Abstract

Source camera identification (SCI) is an intriguing problem in digital forensics, which identifies the source device of given images. However, most existing works require sufficient training samples to ensure performance. In this work, we propose a method based on semi-supervised ensemble learning (multi-DS) strategy, which extends labeled data set by multi-distance-based clustering strategy and then calibrate the pseudo-labels through a self-correction mechanism. Next, we iteratively perform the calibration-appending-training process to improve our model. We design comprehensive experiments, and our model achieves satisfactory performance on benchmark public databases (i.e., Dresden, VISION, and SOCRatES).

1. Introduction

Digital images are widely used in social media and people’s modern life. An enormous number of user-friendly photo editing apps are available in the market, such that it is important to identify the authenticity of digital images for preventing digital images from being used by intentional people and suffering malicious tampering. In the field of digital image forensics, the most important approach is passive forensics, as it is often difficult to obtain reliable digital watermark or digital label information in the actual judicial forensic scenario. Source camera identification (SCI) is an important branch in the field of digital image forensics [1]. Due to the difference in device types, models, individual hardware, and build-in image generation algorithms in the image-creating process, these unique marks may be left as trace in the images. For example, Lucas et al. [2] analyze the noise introduced in the imaging process and take the photo-response nonuniformity (PRNU) noise as the fingerprint, which proved the feasibility of mining the internal traces of the image.

Previous methods obtain high identification accuracy when training sets are sufficiently large, and insufficient training sets (aka few-shot sets) may significantly impact the performance. However, the construction of large labeled sample set is time-consuming, laborious, and sometimes impossible. It is interesting to find a method, which gives accurate identification results even when the training samples with known labels are limited, or at least provides a more reliable and convincing source identification result. At present, the typical methods to solve few-shot sample problems are usually data expansion, such as generating virtual samples [3, 4], data enhancement [57], and semi-supervised learning [810]. We attempt to solve the problem of source camera identification in the case of few-shot samples based on the semi-supervised learning method and try to propose an algorithm model that can be applied to actual judicial forensics and other scenarios to ensure a relatively reliable source camera identification accuracy rate.

In this work, we propose a distance-based semi-supervised ensemble learning (multi-DS) strategy to solve the camera source identification problem under few-shot sample conditions. We first perform classifying by comprehensively comparing multiple distances between unlabeled-labeled sample pairs, then mark unlabeled image samples with pseudo-labels by a sorting algorithm, calibrate the pseudo-labels by a support vector machine (SVM)-based self-correction mechanism, and append the new labeled sample to the training set. By iteratively repeating the labeling-calibrating-appending-training process, we obtain the final model with stopping conditions.

Our contributions are as follows:(i)We propose a distance semi-supervised ensemble learning (multi-DS) strategy, which integrates multiple distance indicators to cluster unlabeled samples based on few-shot labeled samples, so as to make full use of the information of few-shot sample data sets.(ii)We mark the samples close to the cluster center with pseudo-labels of the same class as reliable pseudo-label samples and use these samples to expand the few-shot sample set for SVM self-correction training, so as to optimize the model through continuous iteration.(iii)We conduct comprehensive experiments on multiple threshold parameters in multi-DS strategy to prove the effectiveness of our model. The experimental results show that our strategy is superior to other existing methods on few-shot sample data sets.

The rest of this article is structured as follows. The second part introduces the field of camera source recognition and related work in few-shot sample scenes. The third part introduces in detail our proposed distance-based semi-supervised ensemble learning (multi-DS) strategy. The fourth part provides some background information about the strategy we used. In the fifth part, we discussed the experimental design and results in detail. A summary of our work can be found in Section 6.

In this work, we mainly focus on camera model identification. In this section, we review the existing work in this field. In the past two decades, based on different levels of camera source identification, people have gradually carried out SCI research from three directions: equipment-based, model-based, and individual-based. We mainly identify the source according to the model of the source camera.

Model-based SCI can identify the source camera model of a given image, to accurately identify the specific brand and model of the source camera. Since each camera brand has a different model, statistical characteristics are of sufficient importance. Kharrazi et al. [11] and others proposed that through the inherent structural differences in the camera model, the statistical characteristics of the image color correlation and color energy ratio can be used for SCI problems. They use wavelet features [12] and image quality features [13] to collect image information and then build higher-order statistical models of natural images. Support vector machines are then used to discriminate between untouched and adulterated images, so as to complete source camera identification. In addition, they also used more features, including statistical features such as color correlation, color energy ratio, and neighborhood distribution centroid for identification. Çeliktutan et al. [14] combined three sets of forensic features to measure, including binary similarity metric (BSM), image quality metric (IQM), and high-order wavelet statistics (HOWS), which further proved the feasibility of comprehensive forensics of multiple features. Swaminathan et al. [15] searched for the correlation between image pixels caused by color filter array (CFA) interpolation features and proposed a linear interpolation model based on the peaks existing in the spectral domain to estimate the neighborhood CFA interpolation coefficients. Xu et al. [16] proposed a method of combining the local binary pattern (LBP) and local phase quantization (LPQ) features of the hue and value color channels in the hue, saturation, and value (HSV) color space. They extracted the LBP features and the LPQ features from the contourlet transform coefficients of the original image and residual noise image. The LPQ feature recognizes 10 camera brands from the Dresden image data set with an accuracy rate of 99.8%.

However, the work of the above researchers is aimed at the training environment with sufficient labeled samples, but when the labeled training samples are insufficient, the SVM classifier or other learning algorithms cannot be fully trained, which will significantly reduce the classification accuracy of SCI. Tan et al. [17] constructed multiple prototype sets using the ensemble projection (EP) method and used richer features to solve the problem of insufficient known labels in the few-shot sample problem. Liang et al. [18] used the attention multisource fusion few-shot learning method (AMF-FSL) to transfer the classification ability of few-shot learning from multisource data to target data, which improved the generalization ability of the classification model in cross-domain. Sameer and Naskar [19] used the deep Siamese network method to enhance the training space by forming paired samples from the same camera model and different camera models and obtained a better model of camera source identification problem. Huo et al. [20] focused on the scene of zero sample and few sample mixed learning with extreme scarcity. They chose to obtain more noisy label samples from the image search engine for data expansion and then used the enhanced but noisy label training data for projection learning through noise suppression and semantic projection learning algorithms, which provide a feasible scheme for practical application scenarios when data are particularly inadequate.

In fact, the number of training samples in source camera identification is often difficult to meet the needs of identification. Therefore, considering the practical application scenarios of few-shot labeled samples, we propose our strategy below.

3. The Proposed Multi-DS Strategy

To make full use of more detailed information of the few-shot samples, this section proposes a strategy of few-shot sample image source identification based on distance semi-supervised ensemble learning, named multi-DS algorithm.

The complete multi-DS algorithm block diagram is shown in Figure 1. After extracting sufficient features for all data sets, two screenings are performed based on the distance threshold and the statistical frequency threshold based on the three distance indicators to complete the semi-supervised ensemble learning process, and the pseudo-label information is corrected repeatedly until it is stable. The details of each part of the algorithm are as follows.

The block diagram of feature extraction and multi-distance integrated integration to filter pseudo-label samples is shown in Figure 2. To fully obtain the detailed information of all camera images, multi-statistical features are extracted for all samples in training and testing. At the same time, to ensure the accuracy of assigning pseudo-labels, we use a variety of distance indicators to supervise the distribution process of pseudo-labels: for the sample feature information of each dimension, we calculate the Euclidean distance, Manhattan distance, and Chebyshev distance from each test sample to each training sample, to obtain multi-distance indicators. By sorting these distance indicators, we select the nearest unlabeled samples of each training sample to assign the same class of pseudo-labels, and the samples with statistical times not less than are selected as effective pseudo-label samples to complete semi-supervised ensemble learning training.

The pseudo-code corresponding to the above process is illustrated in Algorithm 1. In the second step, we generate distance vectors based on multiple distance indicators and sort them in ascending order to match the nearest unlabeled samples for each labeled sample. In the third step, we count the selected times of these unlabeled samples in each camera class and mark the unlabeled samples not less than the statistical time threshold as valid pseudo-label samples, to realize a round of pseudo-label sample filtering.

Symbols:
: Set of unlabeled samples
: Set of labels
: Set of samples with label
: Collection of distance measures
: The distance threshold
: The statistics threshold
(1)  Extract multiple feature vectors from all training samples;
(2)  Calculate multiple distance parameters as follows:
for and do
  Form distance vector
  Sort in ascend order and take the first entries to form a new vector
end for
(3)  Selecting pseudo-label samples:
  fordo
   fordo
    Count for in
    if Total count of is greater than a selected threshold then
     Label with pseudo-label (Note: one unlabeled may be marked with multiple pseudo-labels and appended to corresponding labeled sets)
    end if
   end for
  end for

Among them, to obtain more a priori information as much as possible in the case of few-shot samples, we use LBP and CFA image statistical features. The relevant details will be described in Section 4.

In addition, semi-supervised ensemble learning based on three distance indicators is the focus of our work. In this process, we have carried out two rounds of screening to ensure that the selected pseudo-label samples are as authentic and reliable as possible. Firstly, we get enough pseudo-label samples through the distance threshold , and then, we select sufficiently reliable pseudo-label samples through the statistics threshold . After two thresholds, we can ensure that the final pseudo-label samples are close enough to labeled samples in the spatial domain, so that the distribution of the pseudo-labels can be reliable. Therefore, by calculating multiple distances separately, we can combine the characteristics of multiple distance indicators and use the idea of ensemble learning to select more reliable pseudo-label samples.

After obtaining valid pseudo-label samples, we use SVM self-correction to correct some labels. Through the SVM classifier trained based on few-shot samples and selected pseudo-label samples to retest all unlabeled samples, we can get new pseudo-label samples and then repeat this process until the model converges, to complete the process of SVM self-correction. Finally, we can get the final classification accuracy on the testing data sets, as shown in Figure 3. We use labeled samples to guide the selected pseudo-label samples for semi-supervised learning. At the same time, we update iteratively based on the results of each round of SVM classifier and then train the known labeled samples and the updated pseudo-label samples with SVM classifier to get the final model. The details of relevant strategy will be described in Section 4.

In this process, the reliable pseudo-label samples are the samples with high reliability screened through multiple thresholds, but they may not correspond to the real labels. At the same time, the pseudo-label samples also expand the training samples, making the available information more rich and reliable.

With these reliable pseudo-label samples, in the model training stage, we can guide and supervise the labeled samples with the help of the information of the pseudo-label samples. In this way, we can effectively enrich the information of the few-shot sample data sets and finally complete the process of semi-supervised learning.

4.1. Feature Extraction Algorithm

To verify the rationality of our proposed multi-DS strategy, we combine the two characteristics of LBP and CFA. These two features have different generalization performances for image features. Among them, the LBP feature with equivalent mode effectively reduces the noise impact of high-frequency mode and has a strong ability to generalize features, while the CFA feature shows the inherent algorithmic differences in the cameras. The inherent difference between camera models is strong, so these two features can extract the inherent information of the image from two different aspects, to realize the extraction of camera features more scientifically.

4.1.1. CFA Features

CFA interpolation algorithm is a widely used model-based variable camera feature. Swaminathan et al. [21] calculated the pixel interpolation of each color channel, respectively, and calculated the CFA interpolation coefficient through the linear model. Taking the green channel of the image as an example, assuming that the position coordinate of the selected interpolation point is and the selected neighborhood size is the pixel area of , the interpolation model of interpolation pixels is shown as follows:

Among them, , , and are the CFA interpolation coefficient weights of the green, red, and blue channels in the color image, respectively. is the interpolation coefficient of the neighborhood of the pixel in the green channel. Similarly, and represent the corresponding coefficients of the red channel and the blue channel, respectively. Equation (2) can be abbreviated into the vector form of the following equation:

The interpolation coefficients of other red channels and blue channels of color images can be similarly calculated by this strategy. Finally, the interpolation coefficients on the three color channels are obtained. For a given digital image, if we set , the neighborhood size of the interpolation point is . According to the Bayer CFA structure, the G interpolation coefficients of R and B sampling points and the R and B interpolation coefficients of two G sampling points are estimated, respectively, and a total of interpolation coefficients are obtained. The mean and variance of 240 dimensional CFA interpolation coefficients are integrated into 480 dimensional CFA features.

4.1.2. LBP Features

LBP feature is an operator that reflects the local texture features of an image. It has the advantages of gray invariance, rotation invariance, and so on. The original LBP operator takes the gray value of the pixel of the center point as the standard threshold in the window of , and the gray value of the other eight points is 1 if it is greater than the threshold and 0 if it is less than the threshold. After comparison, the eight bit binary number will be converted into decimal, which is the LBP value of the center point. The calculation formula is shown as follows:

Among them, parameter is the gray value of the central point pixel, represents the gray value of the adjacent points around the central point, and is the total number of the surrounding adjacent points around the central pixel with as the radius, where . is the binarization threshold processing function.

Considering that there are modes of LBP features for 8 neighborhood pixels, Ojala et al. [22] proposed a uniform pattern for dimensionality reduction: since the LBP feature mode of most images only jumps between 0 and 1 twice, the researchers defined the “equivalent mode” as that the number of numerical jumps after one rotation is less than 3; on the contrary, other modes become “mixed mode” (nonuniform pattern); that is, the LBP features of the mixed mode are integrated and the interference of high-frequency noise is reduced. Accordingly, the dimension of the features is also reduced from 256 to 59. Then, the 59 dimensional LBP features are extracted from the spatial domain, prediction error domain, and wavelet transform domain of the image, respectively. Color images have three color channels, and the post-processing algorithms of red channel and blue channel images are basically the same, so only LBP features of red channel and green channel need to be extracted. A total of dimensional LBP features are obtained.

4.2. Multiple Distance Indicators

We know that the essence of norm is distance, and its significance is to achieve comparison, so we can use a variety of distance indicators to measure distance. Here, we use one norm, two norms, and infinite norms for comprehensive judgment. They are also called Euclidean distance (ED), Manhattan distance (MD), and Chebyshev distance (CD).

The three distances are the measurement parameters for the characteristics of image samples, and their expressions in two-dimensional space are in the following formulas:

4.3. Semi-Supervised Learning

Semi-supervised learning (SSL) mainly aims at the problem of algorithm failure in the case of few-shot data. The characteristic of this strategy is that it does not introduce external information and introduces unlabeled samples with the help of few-shot sample labels and some criteria conditions, to make the greatest use of the information of unlabeled samples from various ways. The expansion of finite sample set is realized to better train the model.

4.4. The Idea of Ensemble Learning

The idea of ensemble learning is to combine multiple weak supervision models to achieve a more comprehensive strong supervision model. The common combination methods are average method, voting method, and learning method.

By combining the different characteristics of multiple distance indicators, we use the idea of ensemble learning to select the pseudo-label samples closer to the labeled samples in the spatial domain. Because of the flexibility of the multiple distance indicators, it is possible to improve the reliability of the selected pseudo-label samples by integrating multiple distance indicators.

4.5. The SVM Self-Correction Classifier

The goal of the SVM classifier is to find a hyperplane to separate two classes of data, making the maximum margin between the samples. Multi-class classification can be realized by reusing SVM classifier. Among them, hyperplane refers to the separation surface that can divide the n-dimensional space into two parts. Support vector is the sample point closest to the classification hyperplane, and the geometric interval is usually used as the distance measure:

Among them, represents the class label of the data point, and and are the parameters of hyperplane . By continuous iterative training, the optimal hyperplane with the largest edge can be found. This iterative process is the training process of SVM classifier.

In our strategy, SVM classifier is used to classify the feature vectors of camera photographs. After identifying the selected pseudo-label samples, we put few-shot labeled and selected pseudo-label samples with labels and multiple feature vectors into the SVM for training. Then, we test with these pseudo-label samples, thereby continually iterating and updating the labels of pseudo-label samples to complete the SVM self-correction training process.

5. Experimental Results and Analysis

5.1. Experimental Image Data Sets and Settings

To fully verify the effectiveness of distance based on semi-supervised ensemble learning, all experiments in this study use the public databases in the field of image forensics: Dresden database, VISION database, and SOCRatES database, which are some popular image data sets used in forensic research.

In this experiment, we select 16 different classes of equipment in the Dresden database [23], 10 different classes of equipment in the VISION database [24], and 10 different types of equipment in the SOCRatES database [25]. The specific equipment information is shown in Tables 13. Among them, the number of training samples is limited, ranging from 5, 10, 15, 20, and 25 images of each class. The test set consists of 150 unlabeled samples for each class, which means 2400 (in the Dresden data set) and 1500 (in the VISION and SOCRatES data sets) image samples in total. Based on our dynamic threshold selection strategy, different numbers of pseudo-label samples (from 25 to 55) were selected for each class, from these unlabeled samples. We expanded the training set using the pseudo-label samples as training samples. The final result is a stable statistical result obtained by averaging the results after 10 random training tests.

5.2. Algorithm Performance Evaluation and Analysis

In this study, the experiment with LBP features is labeled as the LBP-SVM method, and the experiment with CFA features is labeled as the CFA-SVM method. The experiment with double features is labeled as the multi-SVM method. At the same time, for the experiment with the DS method using only one kind of features, the method using only LBP features is called LBP-DS, the method using only CFA features is called CFA-DS, and the final method integrating the two features is called multi-DS.

In this experiment, the number of training samples is 5, 10, 15, 20, and 25, respectively, which is to simulate the situation of different numbers of training samples, and there are two thresholds in this strategy. We conducted detailed experiments on each data set.

For the pseudo-label distance threshold of , we tested the accuracy when the number of training samples was 5, 10, 15, 20, and 25, respectively. The pseudo-label distance threshold ranges from 5 to 50, representing the number of unlabeled samples of each class selected based on multi-distance indicators. By testing the pseudo-label distance threshold with different sample numbers, we finally determined that the parameter is 10, as shown in Figure 4.

For the statistical number threshold , the experimental results are shown in Tables 46. When the number of training samples of each class is 5, 10, 15, 20, and 25, respectively, we test the accuracy of the value of different statistical time threshold . For each labeled sample, we selected the nearest unlabeled samples according to three distance thresholds. According to the number of times these unlabeled samples were selected, we set threshold for each camera class, indicating the minimum number of times each unlabeled sample needs to be selected to be marked as a valid pseudo-labeled sample, that is, the statistical time threshold . Finally, for different numbers of training samples, the threshold of statistical times we choose is 3, 5, 6, 7, and 8, respectively. The reason why the threshold is selected in this way is that when the threshold is increased again, unlabeled samples will not be selected for some classes. When the threshold is reduced, the selected samples will be increased, resulting in a decrease in the purity of the selected sample set.

In this study, 10 random experiments were carried out on several databases, and the results were averaged. The final experimental results in the three experimental databases are shown in Figures 57, and the confusion matrices between different classes are shown in Tables 79.

As shown in Figure 5, in the experiment of Dresden data set, when the number of training samples is 5, the accuracy of CFA-DS method is 18.6% higher than that of the CFA-SVM method, which fully reflects that the CFA-DS method has much better performance than the CFA-SVM method in the case of few-shot samples. However, in terms of LBP features, the performance of LBP-DS method is not satisfactory. Therefore, this study considers using the idea of ensemble learning to combine LBP features and CFA features, that is, multi-DS method, and carries out subsequent experiments on other data sets.

Table 7 shows the confusion matrix obtained by repeating the experiment 10 times when the number of training samples of multi-DS method is 25. Where the accuracy is less than 0.1%, it is marked as “—.” Through the confusion matrix, we can see that Sony_DSC-H50 (SD1) model and Sony_DSC-W170 (SD3) model are 66.5% and 47.4%, respectively, and the degree of mutual confusion is the greatest. Through consulting the data, we find that the results are still the same in many previous research works, which are the same as that in the strategy proposed in this study. This is because the cameras of the two models adopt similar image post-processing algorithms. The difference between camera models is very low.

From the results of multiple experimental data sets, the performance improvement is more obvious when the number of training samples is less. The three experimental databases show similar results, which shows that the strategy proposed in this study is a good solution to the problem of few-shot samples. Therefore, the experimental results of this study can fully reflect the wide applicability and universality of the strategy proposed in this study and help to solve the actual judicial evidence problem of insufficient known samples.

At the same time, to validate the stability of the model, we conduct a set of stability experiments. When the number of samples in each class is quite small, randomly selected samples may not necessarily represent the whole sample population. Therefore, when the number of few-shot samples in each class is only 5, we conduct 20 experiments and average the results as shown in Table 10. We thicken the maximum and minimum values in each data set. The experimental results show that our model is not stable with few-shot samples.

In addition, we compare the proposed multi-DS strategy with other existing methods to verify the performance of our strategy. We reselected 14 (in the Dresden data set), 11 (in the VISION data set), and 10 (in the SOCRatES data set) camera models and 10 labeled samples of each class. These reselected camera classes have the same class settings as the data set in the paper [19] to ensure fairness in our performance evaluation. The experimental results between us and other existing methods are shown in Table 11. In the same experimental setup, our experimental results have better performance than other existing methods.

In addition, we added the experiment of the influence of the number of classes on the classification accuracy. The independent variables range from only part of the data set at the model level to the complete data set. The number of camera model classes in the Dresden database ranges from 14 to 27 (the whole data set), while the number of classes in VISION database ranges from 11 to 35 (the whole data set). The experimental results show that the classification accuracy will gradually decrease when the number of models increases. We verified this result on the Dresden and VISION databases in Figures 8 and 9, which is consistent with the results in [19].

6. Conclusion

In this study, we solve the problem of few-shot label classification in source camera recognition. Due to the practical significance of the few-shot label scenario, it is very important to deal with this challenge effectively. Therefore, we propose a (multi-DS) strategy to solve this problem. Through the ensemble learning of various types of distance indicators and the self-correction of SVM to obtain the unlabeled sample information, the available information is effectively supplemented. Our experimental results show that this strategy can effectively improve the image traceability accuracy when facing few-shot sample data sets and provide a good solution for the actual judicial forensic problem with insufficient number of known samples. In the future work, the actual judicial forensic task of source camera identification with fewer shot samples or in extreme cases deserves our attention and discussion.

Data Availability

The data sets used or analyzed in this study are published data sets: the Dresden data set is unavailable at present, possibly be restricted to protect the patient privacy. The Dresden data used to support the findings of this study are available from the corresponding author upon request. The data set is derived from this paper: Gloe, Thomas, and Rainer Böhme. “The ‘Dresden Image Database’ for benchmarking digital image forensics.” Proceedings of the 2010 ACM symposium on applied computing. 2010 : 1584–1590. The VISION data set is available from this link: https://lesc.dinfo.unifi.it/en/datasets/ The SOCRatES data set is available from this link: http://socrates.eurecom.fr/

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. U1936117, 62106037, and 62076052), the Science and Technology Innovation Foundation of Dalian (No. 2021JJ12GX018), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (No. 202100032), and the Fundamental Research Funds for the Central Universities (No. DUT21GF303).