Abstract

This article proposed a novel classification framework that can classify the samples of multiple domains based on the outputs of multiple models. Different from the existing methods that train single model on all domains, our framework trains multiple models on each domain. On a testing sample, the outputs of all trained models are used to predict the domain of this sample. Then, this sample is classified by the output of models that belong to the predicted domain. Experiments show that our framework achieved higher accuracy than the existing methods. Furthermore, our framework achieves good scalability on multiple domains.

1. Introduction

In these days, deep learning models can achieve good performance in many applications [15]. Generally, the performance of a deep learning model depends on the captured features [610]. More domains are important to the scalability of deep learning system while these increase the difficulty of training high-performance models. Figure 1 introduces an example that includes multiple domains. We use the dataset to present the collected samples of the corresponding domain. These datasets may have a different number of labels and various samples, which increase the difficulty of training high-performance models. Transfer learning [1113] can temporarily solve this problem while the performance is still limited by the structure of models. As Figure 1(a) introduces, a bigger model (deeper structure and more layers) is a general solution as this can capture more features while this is limited by the computational resource or may cause the vanishing gradient problem [1416]. Thus, there should be another way to improve the performance of deep learning system on multiple domains like using multiple models.

Some fusion methods [1719] can utilize multiple models to increase the classification accuracy, where performance depends on the selection of high accurate trained models. Compared with training a single model on multiple domains, training each model on the corresponding domain can be a good solution as this can achieve good scalability as we introduced in Figure 1(b). Generally, the trained model easily ensures high accuracy in the corresponding domain. On the contrary, this model may have low accuracy on the other domains, which reduces the performance of these fusion methods. Thus, on a testing sample, the prediction of the domain is important to the fusion methods.

In this article, we built a novel framework (CMS-CMM, the Classification of Multi-Domain Samples Based on the Cooperation of Multiple Models) to increase the accuracy of classification on the samples of multiple domains. Our contribution can be summarized as the following. (1) We built a novel framework that achieves the scalability of the deep learning system. As the number of domains increased, the difficulty of transfer learning is increased as it has to consider the performance of all domains. On the contrary, our framework only needs to train some deep learning models on the training set of a new domain, which benefits the scalability. (2) Our framework increases the accuracy of classification without increasing the structure of models. Generally, a bigger model increases the classification accuracy while it needs more space of memory that is impossible to satisfy in some applications. Instead, our framework only increases the number of models where each of them lowers the consumption of memory space compared with that of a big model.

The rest of the article is organized as follows: Section 1.1 introduces the existing methods and their problems. In Section 2, we present our framework and related analyses. The experiment is organized in Section 3. Section 5 gives the conclusion and future work.

1.1. Related Works

VoVNet-57 (Variety of View Network) is designed for object classification task, which consists of blocks including 3 convolution layers and 4 stages modules and outputs stride 32 [20]. The sample is passed through convolutional layers, where here the filters consist of a small receptive field. ResNeSt (residual networks) is a state-of-art deep learning model for image classification that uses a modular structure with a split-attention block and applies an attention mechanism to feature map groups [21]. From ResNeSt50 to ResNeSt269, the structure becomes bigger and more complicated, so that these can get higher accuracy especially when there are more and bigger size training samples. Based on the size of testing samples and computational resource, we use ResNeSt101 in this article. RepVGG (re-parameterization visual geometry group) is a classification model, which is improved on the basis of the existing models [22]. DenseNet (densely connected convolutional network) is a convolutional neural network with dense connections [23]. In this network, there is a direct connection between any two layers, which means the input of each layer connects to all the previous layers. VGG16 is a variant of VGG (visual geometry group) models for image classification [24]. ResNet (residual neural network) allows the original input information to be detoured directly to the output, which simplifies the process and reduces the difficulty of training [25].

Some fusion methods have been applied to improve the performance of classification, which applies multiple models [17]. In that article, weighted voting method achieved the highest accuracy among all of the other ones. Weighted voting is also utilized to construct a more reliable classification system [18]. A sliding window is applied to the weighted majority voting algorithm in that article. This method is applied to a DNN (deep neural network), a CNN (convolutional neural network), and an LSTM (long short-term memory) network to improve the performance [19]. These methods can combine the results of models to improve the accuracy. As the weights play an important role in the combination, there should be a validation set to compute these weights. Furthermore, more various models can benefit the improvement of the accuracy. In this article, we also apply these fusion methods to our system for higher accuracy with some optimizations.

When using these methods, the performance of each model is important. Training a model on the single domain can ensure high accuracy on this domain while it may cause low accuracy on the other domains. At the same time, training a model on multiple domains may reduce the accuracy on each domain. Thus, our framework tries to solve this problem, which is introduced in the next section.

2. Our Framework

Before giving the details of our framework, we give the following definitions. These definitions are to explain the implementation of the methods.

2.1. Preliminaries

We set as a sample and as the label of an object. We set as the ground truth on where [26, 27]. The label is to benefit the computation, which is generally a number [28, 29]. For example, when there are 10 objects to be classified, the label is from 0 to 9.

2.2. The Illustration of Our Framework

Figure 2 illustrates our framework, which is named as CMS-CMM. At the first step, our framework trains some existing deep learning models (like deep learning models 0, 1, and 2 in this figure) on each domain (like domains 0 and 1 in this figure). Then, on a testing sample, each model outputs the probability of labels. Firstly, based on the difference in these probabilities, we can predict the domain of this sample (illustrated by the chart). Secondly, we select the trained models of the predicted domain. Then, we can use the output of these models to predict the label of this sample (illustrated by the chart).

2.3. Training the Models and Outputting the Probability of Labels

We select a deep learning model . Then, we train on a domain to get a trained model . We define as the probability of label on the sample , which is the output by the trained model . Generally, the most possible result is selected by the following equation:Which is used as the predicted result.

2.4. Predicting the Domain

In our framework, we firstly select some existing deep learning models . Then, we train these models on each domain to get a set of trained models . When we assume a sample belongs to a domain , we can get a probability of labels by a model . Then, by the other model , we can also get . We define the difference between the model and the model on a sample as follows:

We can define the difference between and on a sample as follows:

Generally, we can select as the model that can achieve the highest accuracy on the validation set. Thus, (3) is to present the difference between the highest accurate model with the other ones. Then, we can select the domain as the predicted domain of sample as below:

Figure 3 uses an example to explain how to predict the domain. We assume that the domain contains the sample and there are three models trained on this domain, which are , , . Thus, these models can well capture the key features of this sample, which leads the probability of dog being high and those of other labels being low. We assume the domain does not contain the samples of dog. Then, we can also get corresponding three trained models , , of this domain. As these models have not captured the features of dog on the training set, these models may capture noisy features of dog (also included by other labels), which causes big difference between the outputs of these models.

2.5. Predicting the Label

Once our framework has predicted the domain of a sample, we can use the corresponding models that are trained on this domain to predict the label of this sample. To increase the accuracy of the prediction, our framework uses the fusion method [19] that is weighted model average as follows:where presents the weight that is applied to the output of models. By using these weights, the output of higher accurate model plays a more important role to the final result. We can compute the by using the validation set. We name our framework with this optimization as CMS-CMM from now on.

2.6. Optimization by the Distribution of the Labels

There are two cases that may cause the wrong prediction of the domain. Figure 4 introduces the two cases. We assume that domain 0 has 100 labels and domain 1 has 10 labels. By the general setting, the trained models of domain 0 will output the probability of 100 labels. At the same time, the trained models of domain 1 will output the probability of 10 labels. In 4(a) of this figure, we input the testing sample of domain 0 into the trained models. The difference between the trained models of domain 1 may be lower than that between the trained models of domain 0 occasionally because the range of error labels is reduced. Especially when the accuracy of models is low, this case easily causes wrong prediction of the domain. In 4(b) of this figure, we input the testing sample of domain 1 into the trained models. The difference between the trained models of domain 0 may be lower than that between the trained models of domain 1 occasionally as the error labels are scattered to a wider range. Especially when the accuracy of models is high, this case also easily causes the wrong prediction of domain.

To solve this problem, we make all of the trained models predict the same number of labels. For example, the trained models on (100 labels) or (10 labels) can predict 10 labels, which is the maximum number of labels among these domains. Then, when the testing samples belong to , we only consider labels 0 to 9 as the possible correct one. Thus, we revise equation (4) as the following when there are different number of labels between domains.where is the possible correct labels of the corresponding domains . We set as the remain labels. For example, when the contains 100 labels and the domain contains 10 labels, we can set is from 10 to 99 for . Thus, all models of these domains output the probability of the same number labels. We can use by using the validation set. We name our framework with this optimization as CMS-CMM-opt from now on.

3. Experiment

We evaluate our methods with the existing ones on some real datasets. When we randomized the parameters, we evaluate 1000 times. We trained the deep learning models (VoVNet-57 [20], ResNeSt50 [21], RepVGG [22], DenseNet [23], VGG16 [24], and ResNet [25]) on some real datasets by the reported default settings of these models. We set the number of epochs [30, 31] as 10 for all these models on any training set. We do not focus on the designing of structure or tuning the hyper-parameters. Instead, we focus on how to use multiple models to achieve the scalability and ensure high accuracy at the same time. We set a random number of validation samples, which is from 500 to 800.

3.1. Introduction of the Datasets

CIFAR-10 [32, 33] has 50000 training samples and 10000 testing samples that belong to 10 labels. We use 50000 training samples to train the models. Then we have 10000 samples left to the validation and testing. CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each [34, 35]. There are 500 training images and 100 testing images per label. We use 50000 training samples to train the models. Then we have 10000 samples left to the validation and testing. The Mini-ImageNet [36, 37] dataset is for few-shot learning evaluation. Its complexity is high due to the use of ImageNet images but requires fewer resources and infrastructure than running on the full ImageNet dataset. We use 48000 training samples to train the models. Then we have 12000 samples left to the validation and testing. EuroSAT [38, 39] dataset is based on satellite images consisting of 10 classes with 27000 labelled samples. We use 21600 as training samples and 5400 as the testing ones. Intel Image Classification [40] dataset contains natural scenes around the world. There are around 14 k images for training, 3 k for testing, and 7 k in prediction (without labels).

3.2. Introduction of the Evaluation Metrics

We introduce some metrics to compare the methods in different dimensions on the testing samples. We assume a sample belongs to the domain and the corresponding ground truth of label is . We define is the predicted result of label and as the predicted result of domain by a method. Then we can define the following evaluation metrics.

CD presents the accuracy of predicting correct domains as below:where the higher one is better. CDCL presents the accuracy of predicting correct domains and correct labels as below:where the higher one is better. CDWL presents that the percentage of predicting correct domains and wrong labels as below:where the lower one is better.

WD presents the percentage of predicting wrong domains as below:where the lower one is better. WDCL presents the percentage of predicting wrong domains and correct labels as below:where the lower one is better. When the prediction of the domain goes wrong, the predicting of labels is meaningless as the labels of different datasets indicates different kinds of objects. WDWL presents the percentage of predicting wrong domains and wrong labels as below:where the lower one is better.

3.3. Evaluation of Domain Prediction

We do not use additional information (like the resolution or size of sample) to predict the correct domain of samples. In Table 1, Maximum appeared method predicts the domain by the maximum appeared label. In more detail, we select the result that appeared maximum times from the trained models of each domain. Among all these results, we select the one that appeared maximum times and set the corresponding domain as the predicted domain. Following the same way, the fusion method [19] predicts the domain by the maximum value of weight probabilities.

We used CD (the accuracy of predicting correct domains) to evaluate the methods. As we can see in Table 1, our CMS-CMM-opt achieves higher accuracy than the existing methods, which is 16.62% higher on average. Furthermore, CMS-CMM-opt achieves higher accuracy than our CMS-CMM, which proves the efficiency of the optimization.

3.4. Evaluation of Label Classification

Our final objective is to classify the samples. Thus, based on the prediction of domain, there must be also the classification of the samples at the following step. Thus, on a testing sample, only when a method correctly predicted the domain and label at the same time, we admit this method correctly output the result. For example, the label 9 of CIFAR-10 and the label 9 of CIFAR-100 mean different kinds of objects.

We used CDCL (the accuracy of predicting correct domains and correct labels) to evaluate the methods. As we can see in Table 2, our CMS-CMM-opt achieves higher accuracy than the existing methods, which is 14.01% higher on average. Compared with the domain prediction, the increase of the accuracy is reduced from 16.62% to 14.01% because there is also error when predicting the labels. CMS-CMM use the fusion method [19] as the prediction of labels after the domain prediction. We can see CMS-CMM-opt also achieves higher accuracy than CMS-CMM, which is 11.25% higher on average.

3.5. Evaluation of the Scalability

In this subsection, we do research about the scalability of our framework based on the metric of CDCL (the accuracy of predicting correct domains and correct labels). We added the domain one by one and computed the label classification as Table 3 shows.

As we can see in Table 3, the accuracy of CIFAR-10 remains the same as the number of domains becomes big. On the other side, the accuracies of CIFAR-100 and Mini-ImageNet becomes lower. The accuracy of each models plays important role to the classification accuracy. The other important factor to the accuracy is the similarity between domains, which will be introduced in the next subsection.

3.6. Impact between Domains

We can analyse the impact of a domain to other domains as Table 4 shows. In this table, we compared the CDCL (the accuracy of predicting correct domains and correct labels) of 5 domains with that of 4 domains, which means we drop one domain to evaluate the relation between this domain and other ones. When we drop CIFAR-10, we found the accuracy of CIFAR-100 is increased more than that of others. By the same way, we can find relations between domains. When there are similar labels between domains, the prediction of domain and labels may easily go wrong. For example, the label “fox” of CIFAR-100 is similar to the label “white fox” of Mini-ImageNet. Thus, how to consider the similarity between datasets are important to the increase of accuracy.

3.7. Evaluation on the Number of Models

In this subsection, we evaluate the relation between the number of models and CDCL (the accuracy of predicting correct domains and correct labels). We set the number of models is from 2 to 6. As there may be different combinations of models, we evaluate the average accuracy of these combinations. As we can see in Figure 5, the accuracy of all datasets increases as the number of models becomes bigger. On the other side, when the number of models is 6, the accuracy of some datasets becomes lower than that of 5 models. When there are low accurate models, these may lower the classification accuracy of our framework. The proper number of models can be computed by the validation set.

3.8. Evaluation by More Metrics on All Testing Samples

As we can see in Table 5, our methods achieved better performance in the most of metrics. In the CDWL (the percentage of predicting correct domains and wrong labels) case, the percentages of our methods are higher than those of the other methods. This is because our methods can predict more correct domains, which may cause more wrong labels. Compared with the fusion method [19], our method can increase 19.14% of CDCL while only increasing 9.01% CDWL.

3.9. Evaluation of Execution Time and Memory Consumption

Table 6 shows the total execution time and maximum memory consumption of each trained model on the corresponding dataset. We use Tesla K80 of NVIDIA [41] to run the models. In more detail, we use Tesla K80 of NVIDIA to run the model VoVNet-57 on CIFAR-10 and record the total execution time and maximum memory consumption of this model, which is shown in this table. Then, we also can use Tesla K80 of NVIDIA to run the model VoVNet-57 on the other datasets and record the total execution time and maximum memory consumption of this model. By the same pattern, we can run the other models on each dataset and record the total execution time and maximum memory consumption of these models.

Our methods generate multiple models on each dataset, which causes the runtime our methods become bigger than those of using single model. CMS-CMM-opt in serial runs the model one by one, which causes the execution time equals to the following: the execution time of single model × the number of models + the execution time of our fusion process. In more detail, our CMS-CMM-opt in serial run multiple models (one after one) on CIFAR-10. Then we run our fusion method. During these processes, we record the total execution time and maximum memory consumption that is shown in Table 6.

A simple solution to reduce the execution time is that we can use less models but this may lower the accuracy. To further reduce the execution time without lowing the accuracy, we run the models on the distributed computational nodes of a cluster based on the parallel pattern of paper [42]. In more detail, CMS-CMM-opt in parallel utilizes multiple computational nodes where each node has a Tesla K80 of NVIDIA [41]. As each node can run the model at the same time, the total execution time can be reduced. The total execution time is recorded as the end of all nodes and the maximum memory consumption is counted as the maximum one among these nodes. As CMS-CMM-opt in parallel of Table 6 indicates, the total runtime is reduced compared with CMS-CMM-opt in serial. The additional execution time is caused by the communication and our fusion process. The additional memory consumption is caused by the buffers of communication and our fusion process.

To achieve higher accuracy by single model on multiple domains, the structure becomes deeper and more complex which causes the memory consumption becomes bigger. For example, the V-MoE of Google achieves high accuracy with a trained model of 15 billion parameters on the ImageNet [43]. Compared with super model solution, our framework is more scalable.

3.10. Illustration of the Domain and Label Classification

Firstly, we use an example to explain the domain classification. As Figure 6 shows, we select the testing samples of label 1 that belongs to domain CIFAR-10. Then, we run all of the models on this sample and compute the average difference between the outputs of trained models that belong to the same datasets. Figure 6(a) is the difference of the outputs between the models that are trained on CIFAR-10. Figures 6(a)6(e) are the differences of the outputs between the models that are trained on CIFAR-100, Mini-ImageNet, EuroSAT, and Intel image classification. As the testing samples belong to the CIFAR-10, the difference between the trained models of CIFAR-10 is obviously smaller than those of the other datasets. Thus, the domain classification based on model difference is reasonable.

We present the statistical result of model difference by the following. For a sample that belongs to the domain , we use to present the average difference between the trained models of as below:where is defined by (2), is the output of highest accurate model, and is the output of the other model. For a sample that belongs to the domain , we use to present the difference between the models of () as below:where is the output of highest accurate model on the validation samples of , is the output of the other model the validation samples of . As we can see in Table 7, is obviously smaller than on each dataset that means we can use this value the predict the domain. Based on this analysis, our methods further optimize the prediction of domain. CIFAR-100 and Mini-ImageNet has bigger number of labels than the other datasets, which causes the model difference bigger than that of the other datasets.

Secondly, we use three examples to explain the label classification based on the trained models of corresponding domain. In the first example, we select the testing samples of label 1 from CIFAR-10. Then, we run the trained models of CIFAR-10 on these samples. As Figure 7(a) shows, the average probability of label 1 by each model (trained on CIFAR-10) is obviously higher than those of the other labels. In Figure 7(b) case, we select the label 9 testing samples from CIFAR-100. Then, we run the trained models of CIFAR-100 on these samples. In Figure 7(c) case, we select label 90 testing samples from Mini-ImageNet. Then, we run the trained models of Mini-ImageNet on these samples. All of these cases show that the average probability of ground truth label is obviously higher than those of the other labels when we correctly select the trained models of corresponding dataset. Thus, the classification based on the probability of labels is reasonable.

We present the statistical analysis of label probability by the models. For a sample that belongs to the domain , we define the average probability of ground truth label by the trained models of as below:where is introduced in equation (1). For a sample that belongs to the domain , we define the average value of maximum probability of other label by the models as below:

As we can see in Table 8, is obviously bigger than on each dataset that means we can use this value to predict the label. Based on this analysis, our methods further optimize the prediction of label.

3.11. Introduction of the Employed Acronyms

We use Table 9 to give the introduction of the employed acronyms in this article for reader’s convenience.

4. Conclusions

In this article, we have introduced a novel framework that achieves the scalability of classification by using multiple models. Different from the existing single super model methods, our framework lowers the consumption of computational resource and achieves good scalability at the same time. Furthermore, we solve the problem of existing fusion methods. Our framework can be a good solution for the applications, which has to classify more domains of samples.

In the future work, we will do research about how to solve the problem of similarity between domains and labels. In some cases, the similarity is caused by the similar labels of different domains like the “fox” in CIFAR-10 and “white fox” in CIFAR-100. In other cases, this may be caused by the similar features between labels of the same domain, which is related to the accuracy of models. We believe that these factors are the key of increasing the accuracy of classification.

Data Availability

The data used in this study are available at CIFAR-10: https://tensorflow.Google.cn/datasets/catalog/cifar10, CIFAR-100: https://tensorflow.Google.cn/datasets/catalog/cifar100, Mini-ImageNet: https://github.com/topics/miniimagenet, EuroSAT: https://tensorflow.Google.cn/datasets/catalog/eurosat, and Intel image classification: https://www.kaggle.com/datasets/puneet6060/intel-image-classification.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work has been supported by the National Natural Science Foundation of China (Grant nos. 61802279, 6180021345, 61702281, and 61702366), Natural Science Foundation of Tianjin (Grant nos. 18JCQNJC70300, 19JCTPJC49200, 19PTZWHZ00020, and 19JCYBJC15800), the Fundamental Research Funds for the Tianjin Universities (Grant no. 2019KJ019), and The Tianjin Science and Technology Program (Grant no. 19PTZWHZ00020) and in part by the State Key Laboratory of ASIC and System (Grant nos. 2021KF014 and 2021KF015) and Tianjin Educational Commission Scientific Research Program Project (Grant nos. 2020KJ112 and 2018KJ215) and the fund of Beijing Polytechnic (Grant no. 2022X017-KXZ).