Abstract

Automatic biology image classification is essential for biodiversity conservation and ecological study. Recently, due to the record-shattering performance, deep convolutional neural networks (DCNNs) have been used more often in biology image classification. However, training DCNNs requires a large amount of labeled data, which may be difficult to collect for some organisms. This study was carried out to exploit cross-domain transfer learning for DCNNs with limited data. According to the literature, previous studies mainly focus on transferring from ImageNet to a specific domain or transferring between two closely related domains. While this study explores deep transfer learning between species from different domains and analyzes the situation when there is a huge difference between the source domain and the target domain. Inspired by the analysis of previous studies, the effect of biology cross-domain image classification in transfer learning is proposed. In this work, the multiple transfer learning scheme is designed to exploit deep transfer learning on several biology image datasets from different domains. There may be a huge difference between the source domain and the target domain, causing poor performance on transfer learning. To address this problem, multistage transfer learning is proposed by introducing an intermediate domain. The experimental results show the effectiveness of cross-domain transfer learning and the importance of data amount and validate the potential of multistage transfer learning.

1. Introduction

Building accurate knowledge of the identity, taxonomy, the geographic distribution, and the evolution of living species are essential for a sustainable development of humanity as well as for biodiversity conservation.

In terrestrial ecosystems, plants are extremely complex and diverse, and there are millions of different plant species [1, 2]. For us, plants must be classified into identifiable groups in order to have a clear, organized way of identifying the diverse array of plants and some specific applications such as weed control [3, 4].

Besides, the study of marine ecosystems is vital for global climate and environment protection [58]. There are many kinds of organisms in the marine worth studying, such as fish and plankton, which play an important role in the ecosystem [9] and the marine food chain [10].

At the very beginning, species classification was usually implemented on morphological diagnoses provided by taxonomic studies [11] in a manual identification process. However, for some species like weed plants and plankton, only experts such as taxonomists and trained technicians can identify taxa accurately. Furthermore, one expert may only identify a limited number of species in a specific domain (such as only species of weeds or phytoplankton) because it requires special skills acquired through extensive experiences [3, 12]. At the same time, there is an increasing shortage of skilled taxonomists [13]. The declining and partly nonexistent taxonomic knowledge within the general public has been termed “taxonomic crisis” [14], making great challenges to the future of biological study and conservation [11].

Using computer-based multimedia identification tools with computer vision and machine learning techniques have been considered as promising solutions to classify organisms, and a lot of work has been done on this topic [15, 16].

1.1. Traditional Image Classification

The traditional image classification process can be generally divided into three steps: image preprocessing, feature extraction/description, and classification [17]. Some preprocessing techniques are often used in the image classification system for producing a suitable enhanced image for the next feature extraction step, such as image denoising, image enhancement, image segmentation, and so on [18]. Feature extraction refers to taking measurements, geometric or otherwise, of possibly segmented, meaningful regions in the image [19]. To characterize and describe some properties of the organism image by a set of values, computer vision experts have handcrafted a lot of features. In previous studies, some general features like size [20], color, shape context [2124], invariant moments, granulometric features, co-occurrence matrix, Fourier descriptor, Gabor filters, local binary pattern (LBP) [25], histograms of oriented gradients (HOG), scale invariant feature transform (SIFT) etc., have been used commonly. There are also some features that have been designed for some specific species [2628].

In the classification step, all extracted features are concatenated into a feature vector and then fed into the subsequent classifiers. Several kinds of traditional classifiers have been employed in previous studies, including k-nearest neighbor (kNN) [22], decision tree (DT) [22], random forest (RF) [29, 30], neural network (NN) [22, 23], support vector machines (SVM), and ensemble learning methods [12, 22, 31, 32].

However, the handcrafted features are usually lack of robustness and cannot represent the complex biomorphic characteristics of some organisms [12]. Besides, some features are elaborately handcrafted for specific organisms [33], which often perform poorly after being extended to other organisms. These traditional classifiers usually have not high prediction accuracy on different datasets [12]. Especially when the datasets are big or contain more than 20 categories, these classifiers may be limited by the “curse of dimensionality” [34], so that they are hard to be directly applied for ecological studies.

1.2. Deep Convolutional Neural Networks

In recent years, DCNNs [3542] have become a mainstay of computer vision community due to their record-shattering performance in the ImageNet large-scale visual recognition challenge (ILSVRC) [43]. ImageNet is a large-scale image dataset with 1000 classes, containing 1.3 million training images, 50,000 validation images, and 100,000 testing images. DCNNs consist of a stack of learned convolution filters that extract hierarchical contextual image features, thus are high-capacity classifiers. With the high capability, DCNNs can find the relevant contextual image features in classification problems intelligently and are less likely to be restricted by the “curse of dimensionality. ” Moreover, unlike traditional methods, DCNNs do not need to divide the training process into several steps but use end-to-end learning mechanism, which is more suitable for real applications. The outstanding performance of DCNNs in image classification and other problems has received unprecedented attention, prompting scholars to apply them to various practical problems including biology image classification [3, 4448]. Nevertheless, the very large number of parameters in DCNNs requires large-scale annotated training data. For some organisms inhabiting a complex environment, such as some marine and even microscopic organisms, it is very difficult to collect their images. For another thing, the collected data can only be used after being precisely classified by experienced experts. While the experienced experts are often scarce and one expert can only identify a limited number of species in a specific domain (such as only species of weeds or phytoplankton) [12], the data available in practical studies may be insufficient to fully exploit the potential of DCNNs.

1.3. Transfer Learning with DCNNs

Transfer learning aims to transfer knowledge between the source domain and the target domain [49]. In biology image classification or some other scenarios, obtaining training data might be difficult and expensive. However, transfer learning can overcome the deficit of training examples in some domains by adapting classifiers trained on another domain [50]. There are two ways to apply transfer learning with DCNNs. One is treating the DCNN as a big feature extractor and utilizing the pretrained network with learning weights to extract features that would be subsequently used in a new domain. The outputs of the DCNN are considered as high-level features and are then fed into the following classifier. Another is to fine-tune the network weights by training the network with the data from the new domain. In this case, the dimension of the output layer must be changed to match the number of classes in the new domain dataset. There are some studies about biology image classification using transfer learning. Ge et al. [51] learned a domain-generic DCNN for the task of plant classification, by applying transfer learning on the parameters of the GoogLeNet [37] model (pretrained on the large-scale ImageNet dataset) using all of the training data for the plant classification task. Lee et al. [52] incorporated transfer learning by pretraining DCNN with class-normalized data and fine-tuning with original data.

Orenstein and Beijbom [53] built on the insights from Kaggle’s National Data Science Bowl (NDSB) and investigated how DCNNs perform on several datasets of in situ plankton images, and their study suggests that weights from a highly tuned network for one planktonic image set could be used effectively in another plankton domain. Ge and Yu [54] introduced a source-target selective joint fine-tuning scheme for improving the performance of deep learning tasks with insufficient training data. Their idea is to identify and use a subset of training images from the original source learning task whose low-level characteristics are similar to those from the target learning task and jointly fine-tune shared convolutional layers for both tasks.

Previous studies about transfer learning with DCNNs mainly focused on the tasks which transfer from ImageNet to a specific domain or transfer between two closely related domains [53]. Only a few studies exploited the transfer learning between two domains that are not directly related. When applying transfer learning to biology image classification, the different distance between species in the source domain and the target domain may have different effects on the performance. Although there is a certain biological distance between the two domains, they may share some common patterns in the view of DCNNs.

In this paper, inspired by the analysis of the literature and practical applications, deep transfer learning for biology cross-domain image classification is explored. By analyzing the experimental results on image datasets in different biology domains, including flowers, plant seedlings, plankton, and fish, some interesting conclusions are drawn. The main contributions of this paper can be listed as follows:(1)A multiple transfer learning scheme is designed to explore deep transfer learning for biology cross-domain image classification. Following this scheme, deep transfer learning is applied among datasets in multiple domains.(2)Even there is no clear relationship between the species from the source domain and the target domain, deep transfer learning can also be applied to improve the DCNNs performance. The features learned by DCNNs in one domain are high-level and can be transferred to another domain with a large distance.(3)Fine-tuning on ImageNet usually gets a better result than training from scratch. However, for some datasets which have huge differences from ImageNet, using ImageNet as the source domain dataset may not get the results improved. Based on multiple transfer learning, multistage transfer learning is proposed to address this issue. In the first stage, the DCNNs on ImageNet is pre-pretrained; in the second stage, the DCNNs on an intermediate domain is pretrained, aiming to make the features adapted to the target domain; finally, the DCNNs on the target domain is fine-tuned to get a possible better result.

The experimental results show the potential of cross-domain transfer learning and may provide some ideas for other people who use transfer learning to study biology image classification or other related issues.

2. Methods

To exploit deep transfer learning for biology cross-domain image classification, multiple transfer learning scheme and propose multistage transfer learning are designed to train DCNNs with several datasets from different domains.

2.1. Deep Convolutional Neural Networks (DCNNs)

Several DCNNs were trained on datasets from different domains, including AlexNet [35], VGG-16 [36, 55], GoogLeNet v3 [39], ResNet [40, 56, 57] with 18, 34, 50, 101 and 152 layers. Table 1 shows their depths, parameter numbers and performances on ImageNet dataset.

AlexNet [35] consists of five convolutional layers and three fully connected layers. There are three max-pooling layers of after layers 1, 2, and 5. In the first layer, the 3 channels in the filters correspond to the red, green and blue components of the input image. The local response normalization (LRN) [35] was dropped in our implementation, which was introduced in AlexNet but was no longer used in subsequent DCNNs as it was replaced with batch normalization [38].

VGG-16 [36] consists of 13 convolutional layers and 3 fully-connected layers. In order to increase the depth of the network, the small convolution filters are used in all convolutional layers.

GoogLeNet [37] has 22 layers, which consist of three convolutional layers, nine inception layers (each of which is two convolutional layers deep), and one fully connected layer. The inception layer is composed of parallel connections with different sized filters, including , , and , along with max-pooling, are used for each parallel connection. The outputs of each connection in the inception module are concatenated together as the inception output. Using multiple filter sizes has the effect of processing the input at multiple scales. In order to reduce the number of weights, filters are applied as a “bottleneck” to reduce the number of channels for each filter. GoogLeNet has multiple versions while batch normalization was introduced in the second version, and the most popular version, as known as GoogLeNet v3, is used in this paper. GoogLeNet v3 decomposes the convolutions by using smaller 1-D filters to reduce the number of weights to go deeper.

As the error back-propagates through the network, the gradient shrinks, which affects the ability to update the parameters in the earlier layers for very deep networks. To deal with the vanishing gradient problem, ResNet uses residual connections. ResNet introduces a “shortcut” module which contains an identity connection so that the “weight” layers (the layers that contain parameters) can be skipped. Rather than learning the function for the weight layers, the shortcut module learns the residual mapping. The “bottleneck” approach used in GoogLeNet, which uses convolution to reduce the number of weight parameters, is also used in ResNet. The ResNet can be implemented with different layers. In this paper, ResNet with 18, 34, 50, 101, and 152 is built.

2.2. Rectified Linear Unit

Rectified Linear Unit (ReLU) activation function is applied to the output of every convolutional layer in all DCNNs used in this paper. The ReLU activation function can be described by the following equation:where indicates the input of ReLU activation function. The ReLU activation function can make DCNNs more sparse. For example, in a randomly initialized network, only about 50% of hidden units are activated (having nonzero output) simultaneously. Another benefit of ReLU is that it reduces the likelihood of vanishing gradient. This arises when , the gradient has a constant value, which results in faster learning of the DCNNs.

2.3. Dropout

Dropout is a technique to reduce overfitting, which sets the output of each hidden neuron to zero with a probability. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation of the training process. Every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. Dropout in the fully-connected layers of AlexNet and VGG is employed.

2.4. Batch Normalization

Batch Normalization [38] speed up the training process and improve accuracy by controlling the input distribution across layers. To this end, the distribution of the layer input activations are normalized such that it has a zero mean and a unit standard deviation, which can be described aswhere and indicate the mean and standard deviation of the distribution of layer input activations, and are parameters that can be learned from training, and is a small constant to avoid numerical problems.

2.5. Softmax

Softmax function is employed after the output layer, which is a fully connected layer with units. Here indicates the number of classes in the image classification task, which has the same meaning in equation (3). The output of the softmax can represent a probability distribution over all the predicted classes, which is computed bywhere represents the output of the -th unit in the last fully-connected layer and ranges from 0 to .

2.6. Data Augmentation

By enlarging the dataset using label-preserving transformations [35, 39] artificially, data augmentation is the easiest way to reduce overfitting on image data. There are three forms of data augmentation in our classification system: feature normalization, image resizing/cropping, and image horizontal flipping. It has been proved that feature normalization can make the gradient descent converge faster [38]. During both the training phase and the test phase, when image data are fed into the system, the system will do feature normalization for each channel of the image, respectively,where indicates the -th channel of the input image; and indicate the mean and standard deviation in the -th channel among all the images in the training set, respectively; indicates the -th channel of the normalized input image.

2.7. Pipeline and Experiment Details

All the DCNNs in this paper are implemented with PyTorch deep learning framework. For GoogLeNet v3 network, firstly the input image will be resized to and then be cropped into ; for other networks, the input image will be resized to and then be cropped into . To prevent substantial overfitting [35], different methods of cropping are employed during the training phase and the test phase. During training phase, randomly cropping are employed by extracting random patches (for GoogLeNet v3 network, it is ) from the images (for GoogLeNet v3 network, it is ). Then, randomly horizontally flip these patches and feed them into the network for training. During the test phase, for each image in the test set, only need to predict once, the foreground organisms in the image are more likely to appear in the center. So, only center cropping is employed.

All the first convolutional layers of the DCNNs in this paper have three channels, corresponding to the three channels of an RGB image. Except for GoogLeNet v3, all the inputs to the DCNNs are fixed-sized images. While for GoogLeNet v3, the input image size is fixed to . If a single-channel gray image is an input to the DCNN, it will be converted to an RGB image with three same channels, whose values are copied from the single-channel image.

To get more details of our experiments, please visit our open-sourced repository BioTL [58] on GitHub.

2.8. Training from Scratch

The DCNNs training procedure generally follows Krizhevsky et al. [35]. The initialization of the network weights is important because the bad initialization can stall learning due to the instability of gradients in DCNNs [36]. The biases with zero and the weights are initialized in all the convolutional layers with , where is the product of the size and the number of channels of the filters in the layer. A weight decay of and a minibatch size of 16 is used. The learning rates of AlexNet and VGG-16 are both initialized to , while the learning rates of all other DCNNs are initialized to . With the initial learning rate, all DCNNs are trained up to 300 epochs, during which every 100 epochs divide the learning rate by 10.

2.9. Cross-Domain Transfer Learning
2.9.1. Fine-Tuning on ImageNet

To fully utilize the potential of DCNNs with small amounts of data, we use ImageNet as the source domain and apply transfer learning to transfer the knowledge learned from ImageNet to the target domain. The operations of data augmentation are the same with training from scratch. Instead of initializing all the weights randomly, they are initialized them (except the last fully-connected layer) with the weights learned from ImageNet dataset. Because the number of classes in the target task may differ from the ImageNet’s 1000 classes, which corresponds to the output dimension of the last fully-connected layer, the weights of the last fully-connected layer in the pretrained modeled were dropped.

2.9.2. Multiple Transfer Learning

To exploit the deep transfer learning for biology image classification, a multiple transfer learning scheme is designed.

The multiple transfer learning scheme is designed to apply transfer learning several times on multiple source domains to observe the effect of cross-domain. For example, at first, a DCNN model on the Flowers17 dataset is trained, which is considered as the source domain. Secondly, all the weights of the trained model are used except the last fully-connected layer to initialize a new model with the same architecture. This is because the dimension of the output in the last fully-connected layer corresponds to the number of classes in the classification task. While the number of classes in the source domain is often different from that in the target domain, so the last fully-connected layer needs to be rebuilt to fit the new task. At last we train the new model with initialized weights on the target domain dataset, such as QUT Fish.

In practice, first of all, ImageNet dataset is used as the source domain and then fine-tune the pretrained models on the five target domain datasets (Flowers17, Flowers102, Plant Seedlings, PlanktonSet 1.0, and QUT Fish). After that, to exploit the effects of different distance between species from the source domain and the target domain, different combinations from the five datasets to apply transfer learning is chosen.

2.9.3. Multistage Transfer Learning

There may be a huge difference between the source domain dataset and the target domain dataset, causing the knowledge learned from the source domain cannot be well-transferred. If the data in the intermediate domain can adapt the learned features to fit the target domain, the hindering effect will not be particularly noticeable or the performance may be improved. To make the knowledge learned from the source domain more transferable, the multistage transfer learning is proposed. To perform multistage transfer learning, to add an intermediate domain between the source domain and the target domain is needed.

In Figure 1, a diagram is used to demonstrate the multistage transfer learning framework. In Figure 1, “CONV 1” to “CONV N” blocks indicate N convolutional layers in the DCNN model, ”FC” block indicates the fully connected layer in the DCNN model. As shown in Figure 1, the proposed multistage transfer learning consists of three stages: prepretrain the models on ImageNet which is considered as the source domain; pretrain the models in an intermediate domain; and fine-tune the models in the target domain. We do not know how to find the best intermediate domain dataset, so followed multiple transfer learning scheme with a grid search to try different datasets as the intermediate domain. Considering on the computational cost consideration, only multistage transfer learning on ResNet-18, ResNet-34, and ResNet-50 three models are explored, which have the similar structures but different depths.

3. Datasets

In this paper, to exploit cross-domain transfer learning, several datasets that come from different domains are choosed, including Oxford Flowers, Plant Seedlings, PlanktonSet 1.0, and QUT Fish.

3.1. Oxford Flowers

There are two versions of Oxford Flowers datasets, Oxford Flowers 17 (Flowers17) and Oxford Flowers 102 (Flowers102). Flowers17 contains 17 classes of flowers, with 80 images in each class which was chosen to be indistinguishable solely by color. Flowers102 dataset consists of 102 classes represented by 40 to 258 images per class and 8189 images in total. There are about 45% of the Flowers17 images are also part of the Flowers102, so Flowers17 is not simply a subset of Flowers102. The image examples of these two datasets are shown in Figures 2 and 3, in which the images in the same row come from the same class and images from different rows come from different classes. According to the recommendation in the official datasets documents, the datasets are splitted into the training set, validation set, and test set, respectively.

3.2. Plant Seedlings

Plant Seedlings [4] dataset contains images of approximately 960 unique plants belonging to 12 species at several growth stages. There are three versions of Plant Seedlings dataset: original raw images, automatically segmented plants, and single plants that are not segmented. The version of non-segmented single plants, which contains 4750 images from 12 species, was used in the Kaggle competition of Plant Seedlings Classification in 2017. In this paper, the version of the nonsegmented single plants is used with 5-fold cross-validation. Figure 4 shows some image examples from Plant Seedlings dataset, in which the images in the same rectangle belong to the same class.

3.3. PlanktonSet 1.0

PlanktonSet 1.0 [59] is a medium-sized dataset with a fair amount of complexity, which was used in the National Data Science Bowl hosted by Kaggle in 2015. Image segments extracted from the raw data contains 60 736 images in total are sorted into 121 plankton classes and split into a training dataset and test dataset with a ratio of 1 : 1. The images obtained using the camera were already processed by a segmentation algorithm to classify and isolate individual organisms and then cropped accordingly, which can be seen in Figure 5. The image samples demonstrate that there are high intraclass variance and small interclass variance among some plankton species.

3.4. QUT Fish

QUT Fish [60] consists of 3960 images collected from 482 fish species. The data contain real-world images of fish captured in conditions defined as “controlled,” “in situ,” and “out-of-the-water” shown in Figure 6. Since “controlled” images are captured with a controlled background and high quality, when splitting the dataset, to split “controlled” images into training set while splitting “in situ” and “out-of-the-water” images with low-quality and pose variations are tended into the test set. At last, QUT Fish dataset is splitted into the training set and the test set with a ratio 1 : 1. As a result of there are some classes in this dataset that only contain two image examples, only 2-fold cross-validation on it can be applied.

Because the amount of training data plays a crucial rule in training DCNNs, the number of all training examples and the average number of training samples are listed in each class for above datasets in Table 2. From Table 2, the scales of all the datasets are small compared to ImageNet, which contains more than one million training image examples. For QUT Fish, the data is extremely scarce since on average there are only 4 training samples in each class. In Table 2, it is obvious that with the increase of the number of layers (depth), the performance of the DCNN on ImageNet is getting better and better. At the same time, the number of parameters is also increasing along with the layers.

4. Evaluation

In this paper, accuracy and as the evaluation metrics are used.

Accuracy is the most intuitive and frequently-used 2 performance measure of the classification task. Accuracy is simply a ratio of correctly predicted samples to the total samples so it can be easily calculated. Accuracy is a good measure if the datasets are symmetric, however, for some imbalanced datasets, accuracy may not reflect the real performance of the classifier. Most of the datasets used in this paper are imbalanced, such as Flowers102, Plant Seedlings, PlanktonSet 1.0, and QUT Fish. The distributions of these datasets can be seen in Figure 7. To evaluate the classification performance on imbalanced datasets, as another metric is used.

Both accuracy and can be calculated by the confusion matrix, which is a table containing information about actual and predicted classifications. As shown in Table 3 (Refer to Table 1 in Ref. [12]), each row of the confusion matrix represents the instances in a predicted class while each column represents the instances in an actual class. For a binary classifier, according to the true condition and predicted condition, the confusion matrix consists of four parts: true positive , true negative , false positive , and false negative . In this way, several measures can be derived from a confusion matrix:

is the harmonic mean of precision and recall

Therefore, takes both and into account and is more useful than accuracy when we have an uneven class distribution.

5. Results

The multiple transfer learning scheme is designed to exploit deep transfer learning on Flowers17, Flowers102, Plant Seedlings, PlanktonSet 1.0, and QUT Fish datasets. When performing transfer learning, to make a comparison, DCNN models are also pretrained on ImageNet and then fine-tuned the models on the five datasets. For multiple transfer learning, one dataset from the above five datasets is chosen as the source domain and chosen another as the target domain. In the table of experimental results, source domaintarget domain are used to illustrate the transfer process from source domain to target domain. To handle the problem of extremely insufficient data, multistage transfer learning is proposed, which introduces an intermediate domain between the source domain and the target domain. Then in the table of experimental results, source domainintermediate domaintarget domain are used to illustrate this multistage transfer process.

5.1. Training from Scratch

The classification results of all DCNNs training from scratch on the five datasets are listed in Tables 4 and 5. From the results, when training from scratch, ResNet-18 achieved the best performance on Flowers17 with 90.29% accuracy and 0.903 1 . Meanwhile, ResNet-18 achieved the best performance on the Plant Seedlings with 98.02% accuracy and 0.977 8 . The best performance on Flowers102 is achieved by ResNet-34, with 57.16% accuracy and 0.551 3 . Similarly, the best performance on QUT Fish is also achieved by ResNet-34, with 36.51% accuracy and 0.281 1 . There are relatively more data in PlanktonSet 1.0 so the DCNNs achieved better results on PlanktonSet 1.0 tend to be deeper. On PlanktonSet 1.0, the best accuracy 77.40% is achieved by ResNet-152 and the best 0.659 3 is achieved by ResNet-101.

5.2. Cross-Domain Transfer Learning

To make a comparison, experiments of fine-tuning the pretrained on ImageNet model are performed. The multiple transfer learning scheme is designed to apply transfer learning on several cross-domain datasets. Similar to fine-tuning on ImageNet, the weights from the model pretrained on the source domain is used to initialize a new DCNN and then fine-tune it on the target domain dataset. To adapt the features in pretrained on ImageNet models to fit the target domain well, multistage transfer learning is proposed by adding an intermediate domain between the source domain and the target domain. The multiple transfer learning results are shown in Tables 6 and 7, where the best source domain and intermediate domain datasets under the same transfer learning conditions are highlighted. To form a sharp contrast, in Table 8, how much gains the cross-domain transfer learning methods get compared with training from scratch is listed.

In the “transfer process” column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the “Accuracy (%)” column, the entries in boldface indicate the best performance with the highest accuracy; in the “” column, the entries in boldface indicate the best performance with the highest .

In the “Transfer process” column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the “Accuracy (%)” column, the entries in boldface indicate the best performance with the highest accuracy; in the “” column, the entries in boldface indicate the best performance with the highest .

5.3. Fine-Tuning on ImageNet

The results of fine-tuning on ImageNet are shown in Tables 9 and 10. Comparing with the training from scratch results in Tables 4 and 5, for Flowers17, Flower102, and QUT Fish dataset, every single model achieves a better performance after fine-tuning on ImageNet. For Flowers17, after fine-tuning on ImageNet, 9.52% accuracy and 0.091 1 were gained on average among all models; for Flowers102, there is a much better result, with 35.65% accuracy and 0.369 5 gain on average after fine-tuning on ImageNet; For QUT Fish, there is also a better result, with 17.96% accuracy and 0.174 6 gain on average after fine-tuning on ImageNet. For Plant Seedlings, there is only improvement with 0.48% accuracy and 0.005 4 on average after fine-tuning on ImageNet, possibly because the original performances are good enough. For PlanktonSet 1.0, some of the results are improved while some of them declined, with 0.02% accuracy decrease and 0.000 2 gain on average after fine-tuning on ImageNet. The results reflect the huge difference between PlanktonSet 1.0 and ImageNet.

5.4. Multiple Transfer Learning

From Tables 6, 11 and 7, it shows that all the multiple transfer learning experiments can get make a better result than training from scratch on Flowers17, Flowers102, and QUT Fish. For these three datasets, using PlanktonSet 1.0 as the source domain can get better results than using other datasets. Specifically, on average, for Flowers17, using PlanktonSet 1.0 as the source domain can get the gain with 4.34% accuracy and 0.042 3 than training from scratch; for Flowers102, using PlanktonSet 1.0 as the source domain can get the gain with 14.84% accuracy and 0.148 9 than training from scratch; for QUT Fish, using PlanktonSet 1.0 as the source domain can get the gain with 11.92% accuracy and 0.105 8 than training from scratch. For the Flowers17 dataset, using Flowers102 as the source domain dataset gets the gain with 0.66% accuracy and 0.007 0 on average, which is much poorer than using PlanktonSet 1.0 as the source domain dataset; for Flowers102, using Flowers17 as the source domain gets the gain with 8.44% and 0.087 1 on average, which is also poorer than using PlanktonSet 1.0 as the source domain dataset.

In the “Transfer process” column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the “Accuracy (%)” column, the entries in boldface indicate the best performance with the highest accuracy; in the “” column, the entries in boldface indicate the best performance with the highest .

For Plant Seedlings, Table 12 shows that using Flowers17 as the source domain dataset can get the gain with 0.21% accuracy and 0.001 7 on average. Using Flowers102, PlanktonSet 1.0 and QUT Fish as the source domain dataset all get the result decreased.

In the “Transfer process” column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the “Accuracy (%)” column, the entries in boldface indicate the best performance with the highest accuracy; in the “” column, the entries in boldface indicate the best performance with the highest .

For PlanktonSet 1.0, Table 13 shows that there is no clear evidence proving that using multiple transfer learning can get the results improved all the time. On average, using Flowers17 as the source domain dataset gets a decrease with 0.06% accuracy and 0.000 6 ; using Flowers102 as the source domain dataset gets a decrease with 0.09% accuracy and 0.002 3 ; using Plant Seedlings as the source domain dataset gets the gain with 0.08% accuracy and 0.000 2 . In fact, even using ImageNet as the source domain dataset get the accuracy decreased with 0.17% accuracy and 0.000 5 on average.

In the “Transfer process” column, the entries in boldface indicate the source domain dataset or intermediate dataset with best classification results; in the “Accuracy (%)” column, the entries in boldface indicate the best performance with the highest accuracy; in the “” column, the entries in boldface indicate the best performance with the highest .

5.5. Multistage Transfer Learning

In Table 8, the gains of cross-domain transfer learning results compared with training from scratch results are listed. For most of the results, fine-tuning on ImageNet can often get the best results. But for some datasets which have a huge difference from ImageNet like PlanktonSet 1.0, fine-tuning on ImageNet may hinder the performances of models.

In multistage transfer learning, after fine-tuning the model on ImageNet, the model is trained on an intermediate domain instead of the target domain. From Table 8, selecting different intermediate domains will have different effects on the final results. For Flowers17, Flowers 102, and Plant Seedlings and QUT Fish, the results of multistage transfer learning do not outperform fine-tuning on ImageNet. For PlanktonSet 1.0, selecting Flowers102 as the intermediate domain can get the best performance, with the gain of 0.003 6 on average.

6. Discussion

In this paper, the multiple transfer learning scheme and the multistage transfer learning method are introduced to exploit cross-domain transfer learning on biology image classification. Our aim is to address the problem that limited labeled data may not fully utilize the feature representation power of DCNNs. In order to achieve this, multiple transfer learning scheme is designed to explore cross-domain transfer learning and proposed multistage transfer learning to learn high-level patterns from different domains to get the learned features fitting the target domain.

Table 1 shows that, with the increase of the DCNN’s depth, the performance on ImageNet can get better and better. But meanwhile, the parameters of the network also increase dramatically, which makes training the network more difficult especially when the amount of data is scarce. In order to compare the performances of different models on different datasets and observe their changes trend intuitively, the performances of different models in Figure 8 are normalized and translated. Added the depth, the number of parameters and the performance on ImageNet for each model to Figure 8, which have also been normalized and translated.

In Table 2, it can be seen that the scales of datasets in this paper are very small compared to ImageNet. It can be seen that after the depth of the network reaches a certain level, its performance will no longer improve as the depth of the network increases. Most of the best results on the datasets are achieved with ResNet-18 or ResNet-34.

DCNNs can learn some high-level patterns that are general, so transfer learning can be used to transfer these learned high-level patterns to the target domain with limited data. When the data amount in the target domain is small, the data amount in the source domain plays an important role in the transfer learning performance. For example, there are more data in PlanktonSet 1.0, so when make PlanktonSet 1.0 as the source domain dataset, the multiple transfer learning results tend to be better (Tables 6, 11 and 7). For example, in Table 11, although there is a closer biological distance between Flowers17 and Flowers102, the performance of using Flowers17 as the source domain dataset is worse than using PlanktonSet 1.0. When the data amount in the target domain is large, the effect of different biological distances between the species in the source domain and the target domain will be reflected (Table 12). In Table 12, although PlanktonSet 1.0 contains more data than all other datasets, using PlanktonSet 1.0 as the source domain dataset did not get the best result.

Multistage transfer learning is proposed to address the problem caused by the big gap between the source domain and the target domain. From Table 8, it can be seen that since there is a huge difference between ImageNet and PlanktonSet 1.0, multistage transfer learning with cross-domain datasets can improve the performance of fine-tuning on ImageNet. But when performing multistage transfer learning, to select a dataset is needed as the intermediate domain which can adapt the learned features fitting to the dataset in the target domain. Otherwise, the performance may be hindered due to the big difference between the dataset in the source domain and dataset in the target domain.

7. Conclusions

In this paper, the multiple transfer learning scheme is designed to exploit deep transfer learning for biology cross-domain image classification. By pretraining the DCNN model in different source domains, the results on the target domain dataset can be improved significantly. It has been proved by the experimental results that even the out-of-domain data are effective when the target domain data is insufficient. Multistage transfer learning method is also proposed which can improve the performance of DCNNs when there is a huge difference between the source domain and the target domain. A limitation of multistage transfer learning is that the datasets in the intermediate domain should be carefully selected; otherwise, the final performance may be hindered. However, it is difficult to found the best way to search the optimal dataset as the intermediate domain and this need further study. In our view, searching the datasets which have similar low-level characteristics with the target domain may be a good choice. Since DCNNs can learn some high-level domain-independent features, the ideas of multiple transfer learning and multistage transfer learning can be widely applied to biology image classification or other fields.

Data Availability

The authors provide links to all datasets here: (1) Oxford Flowers (https://www.robots.ox.ac.uk/~vgg/data/flowers/), (2) Plant Seedlings (https://vision.eng.au.dk/plant-seedlings-dataset/), (3) PlanktonSet 1.0 (https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0127422), and (4) QUT Fish (https://www.kaggle.com/sripaadsrinivasan/fish-species-image-data).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (61771440 and 41776113) and Qingdao Municipal Science and Technology Program (17-1-1-5-jch).