Abstract

In this paper, we propose a novel multitask learning method based on the deep convolutional network. The proposed deep network has four convolutional layers, three max-pooling layers, and two parallel fully connected layers. To adjust the deep network to multitask learning problem, we propose to learn a low-rank deep network so that the relation among different tasks can be explored. We proposed to minimize the number of independent parameter rows of one fully connected layer to explore the relations among different tasks, which is measured by the nuclear norm of the parameter of one fully connected layer, and seek a low-rank parameter matrix. Meanwhile, we also propose to regularize another fully connected layer by sparsity penalty so that the useful features learned by the lower layers can be selected. The learning problem is solved by an iterative algorithm based on gradient descent and back-propagation algorithms. The proposed algorithm is evaluated over benchmark datasets of multiple face attribute prediction, multitask natural language processing, and joint economics index predictions. The evaluation results show the advantage of the low-rank deep CNN model over multitask problems.

1. Introduction

1.1. Backgrounds

In machine learning applications, multitask learning has been a popular topic [19]. It tries to solve multiple related machine learning problems simultaneously. The motive is that, for many situations, multiple tasks are closely related, and the prediction results of different tasks should be consistent. Accordingly, borrowing the prediction of other tasks to help the prediction of a given task is natural. For example, in the face attribute prediction problem, given an image, the prediction of female gender and wearing long hair is usually related [1014]. Moreover, in the problem of natural language processing, it is also natural to leverage the problems of part-of-speech (POS) tagging and noun chuck prediction, since a word with a POS of a noun usually appears in a noun chunk [1519]. Multitask learning aims to build a joint model for multiple tasks from the same input data.

In recent years, deep learning has been proven to be the most powerful data representation method [2032]. Deep learning methods learn a neural network of multiple layers to extract the hierarchical patterns from the original data and provide high-level and abstractive features for the learning problems. For example, for the face-recognition problems, a deep learning model learns simple patterns by the low-level layers, such as lines, edges, circles, and squares. In the median-level layers, parts of faces are learned, such as eyes, noses, mouths, etc. In the high-level layers, patterns of entire faces of different users are obtained. With the deep learning model, we can explore the hidden but effective patterns from the original data directly with multiple layers, even without domain knowledge and hand-coded features. This is a critical advantage compared to traditional shallow learning paradigms models.

Remark 1. If shallow learning paradigm is applied in this case, the model structure will not be sufficient to extract complex hierarchical features. The users of these shallow learning models have to code all these complex hierarchical features manually in the feature extraction process, which is difficult and some times impossible.

Remark 2. If other nonneural networks learning models is used, such as spectral clustering, the hidden pattern of input data features cannot be directly explored. For example, spectral clustering treats each data point as a node in a graph and separates them by cutting the graph. However, it still needs a powerful data representation method to build the graph and cannot work itself well with a high-quality graph. Meanwhile, neural network models, especially deep neural network models, have the ability to represent the hidden patterns of input data points and build the high-quality graph accordingly. Thus, the nonneural network models and neural network models are complementary. Most recently, deep learning has been found very effective for multitask learning problems. For example, the following studies have discussed the usage of deep learning for multitask prediction.(i)Zhang et al. [33] formulated a deep learning model constrained by multiple tasks, so that the early stopping can be applied to different tasks to obtain good learning convergence. Furthermore, different tasks regarding face images, including facial landmark detection, head pose estimation, and facial attribute detection are considered together by using a common deep convolutional neural network.(ii)Liu et al. [34] proposed a deep neural network learning method for multitask learning problems, especially for learning representations across multiple tasks. The proposed method can combine cross-task data, and also regularize the neural network to make it generalized to new tasks. It can be used for both multiple domain classification problems and information retrieval problems.(iii)Collobert and Weston [15] proposed a convolutional neural network for multitask learning problem in natural language processing applications. The targeted multiple tasks include POS tagging, noun chunk prediction, named entity recognition, etc. The proposed network is not only applied to multitask learning, but also applied to semisupervised learning, where only a part of the training set is labeled.(iv)Seltzer and Droppo [35] proposed to learn a deep neural network for multiple tasks which shares the same data representations. This model is used to the applications of acoustic models with a primary task and one or more additional tasks. The tasks include phone labeling, phone context prediction, and state context prediction.However, the relation among different tasks is not explored explicitly. Although the deep neural model can learn effective high-level abstractive features, without explicitly exploring the relation of different tasks, different groups of level features may be used to different tasks. Thus, the deep features are separated for different tasks, and the relationships among different tasks are ignored during the learning process of the deep network. To solve this problem, we propose a novel deep learning method by regularizing the parameters of the neural network regarding multiple tasks by low-rank.

1.2. Our Contributions

The proposed deep neural network is composed of four convolutional layers, three max-pooling layers, and two parallel fully connected layers. The convolutional layers are used to extract useful patterns from the local region of the input data, and the max-pooling layers are used to reduce the size of the intermediate outputs of convolutional layers while keeping the significant responses. The last two fully connected layers are used to map the outputs of convolutional and max-pooling layers to the labels of multiple tasks.

The rows of the transformation matrices of the full connection layers are corresponding to the mapping of different tasks. We assume that the tasks under consideration are closely related; thus, the rows of the transformation matrices are not completely independent to each other; thus, we seek such a transformation matrix with a minimum number of independent rows. We use the rank of the transformation matrix to measure the number of the independent rows and measure it by the nuclear norm. During the learning process, we propose to minimize the nuclear norm of one fully connected layer’s transformation matrix. Meanwhile, we also assume that, for a group of related tasks, only all the high-level features generated by the convolutional layers and max-pooling layers are useful. Thus, it is necessary to select useful features. To this end, we propose to seek sparse rows for the second fully connected layer. The sparsity of the second transformation matrix is measured by its norm, and we also minimize it in the learning process. Of course, we hope the predictions of the two fully connected layers could be low-rank and sparse simultaneously and also consistent with each other. Thus, we propose to minimize the squared norm distance between the prediction vectors of the two fully connected layers. Meanwhile, we also reduce the prediction error and the complexity of the filters of the convolutional layers measured by the squared norms. The objective function is the linear combination of these terms.

We developed an iterative algorithm to minimize the objective function. In each iteration, the transformation matrices and the filters are updated alternately. The transformation matrices are optimized by the gradient descent algorithm, and the filters are optimized by the back-propagation algorithms.

1.3. Paper Origination

The rest parts of this paper are organized as follows. In Section 2, we introduce the proposed method by modeling the problem as a minimization problem and develop an iterative algorithm to solve it. In Section 4, we conclude the paper.

2. Proposed Method

2.1. Problem Modeling

Suppose we have a set of data points for the training process, denoted as , where is the i-th data point. could be an image (presented as a matrix of pixels) or text (a sequence of embedding vectors of words). The problem of multitask learning is to predict the label vectors of m tasks. For , the label vector is denoted as , where if is a positive sample for the j-th task, and , otherwise.

To this end, we build a deep convolutional network to map the input data point to an output label vector. The network is composed of 4 convolutional layers, 3 max-pooling layers, and 2 parallel fully connected layers. The structure of the deep network is given in Figure 1. Please note that, for different types of input data, the convolutional and max-pooling layers are adjustable. For matrix inputs such as images, the layers perform 2D convolution and 2D max-pooling, while for sequences such as text, the layers conduct 1-D convolution and 1-D max-pooling.

We denote the intermediate output vector of the first 7 layers as , where x is the input, and is the number of pools of the last max-pooling layer. The set of filters in the convolutional layers of are denoted as . The outputs of the two parallel fully connected layers are denoted aswhere and are the transformation matrix of the two layers. In the two fully connected layers map, the -dimensional vector of two vectors of m scores for m tasks. Each score measures the degree of the given data point belonging to the positive class. The two fully connected layers are corresponding to the low-rank and sparse prediction results of the network. By fusing their results, we can explore both the low-rank structure of the prediction scores of multitasks and also the sparse structure of the deep features learned from the network. In our model, the first fully connected layer is responsible for the low-rank structure, while the second fully connected layer is responsible for the sparse structure.

The final outputs of the network are the summation of the outputs of the two fully connected layers:

To learn the parameters of the deep network of , we consider the following four problems:(i)Low-Rank Regularization. As we discussed earlier, the tasks are not completely independent from each other, but they are closely related to each other. To explore the relationships between different tasks, we learn a deep and shared representation for the input data x. Based on this shared representation, we also request the transformation matrix of one of the last fully connected layer to be of low-rank. The motive is that the m columns of actually map the representation to the m scores of m tasks. The rank of measures the maximum number of linearly independent columns of . Thus, by minimizing the rank of , we can impose the mapping functions of different tasks to be dependent on each other and minimize the number of independent tasks. To measure the rank the matrix , , we use the nuclear norm of , denoted as . is calculated as the summation of its singular values:where is its l-th singular value. We propose to learn by regularizing its rank as follows:(ii)Sparse Regularization. We further regularize the mapping transformation matrix of the second fully connected layer by sparsity. The motive of the sparsity is that the effective deep features for different tasks might be different, and for each task, not all the features are needed. Although we learn a group of deep features in and share it with all the tasks, for a specific task and its relevant tasks, only a small number of deep features are necessary, and feature selection is a critical step. For the purpose of features selection, we impose the sparsity penalty to the transformation matrix of the second fully connected layer, , since it maps the deep features to the prediction scores of m tasks. To measure the sparsity of , we use the norm of , which is the summation of the absolute values of all the elements of the matrix:We minimize the norm of to learn a sparse (iii)Prediction Consistency. The outputs of the two fully connected layers of low-rank and sparsity may give different results. However, they can be consistent with each other so that the prediction results can be low-rank and sparse simultaneously. To this end, we impose to minimize the squared norm distance between the prediction results of the two layers over all the training data points:(iv)Prediction Error Minimization. We also propose to learn an effective multitask predictor by minimizing the prediction error. To measure the prediction error of a data point, x, we calculate the squared norm distance between its prediction result and its true label vector :We learn the parameters of the deep network by minimizing the errors over all the training data points:(v)Complexity Reduction. Finally, we regularize the filters of the convolutional layers, , by the squared norms to prevent the network from being over complex:The overall optimization problem is the weighted combination of the problems above:where , , , and are the weights of different objective terms, and is the overall objective function. By optimizing this problem, we can obtain a deep convolutional network with a low-rank and sparse deep features for the problem of multitask learning.

2.2. Optimization

To solve the problem in (12), we use the alternate optimization method. The parameters are updated iteratively in an iterative algorithm. When one parameter is updated, others are fixed. In the following sections, we will discuss how to solve them separately.

2.2.1. Updating

When we update , we fix and , remove the terms irrelevant to from (12), and obtain the following optimization problem:where is the objective function of this problem. To solve this problem, we use the gradient descent algorithm. is descended to the direction of the gradient of :where is the gradient function of , and ς is the descent step size. To calculate the gradient function , we first split the objective into two terms:where

The first term is a quadratic term while is a unclear term. Thus, the gradient function of is the sum of the gradient functions of the two terms:where can be easily obtained as

To obtain the gradient function of , we first decompose by singular value decomposition (SVD):where U and V are the two orthogonal matrices, is a diagonal matrix containing all the singular values. According to the Proposition 1 of [36], the gradient of ; thus,

2.2.2. Updating

To update , we also fix other parameters and remove the irrelevant terms:where

is a quadratic term, and

is a norm term. We also use the gradient descent algorithm to update :where

To obtain the gradient function of , we rewrite and as follows:whereand is the i-th row of . For the gradient function of regarding , we decompose the problem to the gradients of with regard to different rows of , since in the problem, the rows are independent to each other:where is the gradient of regarding , and according to (25) and (26), we have the subgradient of as follows:

2.2.3. Updating

To optimize the filters of the deep network, we fix both and and use the back-propagation algorithm based on the chain rule. The corresponding problem is given as follows:where

is a data pointwise term. Back propagation is based on gradient descent algorithm:

and according to the chain rule,where

3. Experiments

In this section, we test the proposed method over several multitask learning problems and compare it to the state-of-the-art deep learning methods for the multitask learning problem.

3.1. Experiment Setting

We test the proposed method over the following benchmark datasets:(i)Large-scale CelebFaces Attributes (CelebA) Dataset. The first dataset we used is a face image dataset, named CelebA Dataset [37]. This dataset has 2,02,599 images, and each image has 40 binary attributes, such as wearing eyeglasses, wearing hats, having a pointy nose, smiling, etc. The prediction of each attribute is treated as a task; thus, this is 40-task multitask learning problem. The input data is image pixels. The downloading URL for this dataset is at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.(ii)Annotated Corpus for Named Entity Recognition. The second dataset we used is a dataset for named entity recognition. It contains 47,959 sentences, which contain 10,48,576 words. Each word is tagged by a named entity type, such as Geographical Entity, Organization, Person, etc., or a nonnamed entity. Moreover, each work is also tagged by a part-of-speech (POS) type, such as noun, pronoun, adjective, determiner, verb, adverb, etc. Meanwhile, we also have the labels of noun chunk. We have three tasks for each work, named entity recognition (NER), POS labeling, and noun clunking. For each word, we use a window of size 7 to extract the context, and the embedding vectors of the words in the window are used as the input. This dataset can be downloaded from https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus.(iii)Economics. The third dataset we used is a dataset for tasks of property price trend and stock price trend prediction. The input data is the wave of historical data of property prices and stock prices and each data point is the data of three months of both prices of properties and stocks, and the label of each data point is the trend of stock price and property price. We collect the data of last 20 years of USA and China and generate a total number of 480 data points.

In the experiments, we split an entire dataset to a training set and a test set of the equal sizes. The training set is used to learn the parameters of the deep network, and then we use the test set to evaluate the performance of the proposed learning method. To measure the performance, we use the average accuracy for different tasks.

3.2. Experiment Results
3.2.1. Comparison of Prediction Accuracy of Different Methods

We compare the proposed method against several deep learning-based multitask methods, including the methods proposed by Zhang et al. [33], Liu et al. [34], Collobert and Weston [15], and Seltzer and Droppo [35]. The results are reported in Figure 2. According to the results, the proposed methods always achieve the best prediction performances, over three multitask learning tasks, especially in the NER and Economics. For the Economics benchmark dataset, our method is the only method which obtains an average prediction accuracy higher than 0.80, while the other methods only obtain accuracies lower than 0.75. This is not surprising since our method has the ability to explore the inner relation between different tasks by the low-rank regularization of the weights of the CNN model for different tasks. In the Economics benchmark dataset, the number of training examples is small; thus, it is even more necessary to borrow the data representation of different tasks. For the CelebFaces dataset, the improvement of the proposed method over the other methods is slight. Moreover, we also observe that the methods of Zhang et al. [33] and Liu et al. [34] outperforms the methods of Collobert and Weston [15] and Seltzer and Droppo [35] in most cases.

3.2.2. Comparison of Running Time of Different Methods

We also report the running time of the training processes of the compared methods in Figure 3. According to the results reported in the figure, the training process of Seltzer and Droppo’s [35] method is the longest, and the most efficient method is Collobert and Weston’s [15] algorithm. Our method’s running time of the training process is longer than Zhang et al.’s [33] and Collobert and Weston’s [15] methods, but still acceptable for the datasets of CelebFaces and NER. While for the training process over the Economics benchmark dataset, the running time is very short compared to the other two datasets, since its size is relatively small.

3.2.3. Influence of Tradeoff Parameters

In our method, there are four important tradeoff parameters, which control the weights of the terms of classification errors, the rank of the weight matrix, and the norm sparsity of the weight matrix, and the consistency of predictions of the sparse model and low-rank model. The four tradeoff parameters are , , , and . We study the influences of the changes of their values to the prediction accuracy and report the results of our method with varying values of these parameters in Figure 4. We have the following observations as follows:(i)According to the results in Figure 4, when the values of increase from 0.01 to 100, the prediction accuracy keeps growing. This is due to the fact that this parameter is the weight of the classification error term, and when its value is increasing, the classification error over the training set plays a more and more important role in the learning process; thus, it boosts the classification performance accordingly. But when its value is larger than 100, the performance improvement is not significant anymore.(ii)When the values of increases, the performance of the proposed keeps improving. This is due to the importance of the low-rank regularization of the proposed method. controls the weight of the low-rank regularization term, and it is the key to explore the relationships among different tasks of multitask problem. This is even more obvious for the Economics dataset, where the data size is small, and cross-task information plays a more important role.(iii)The proposed algorithm seems stable to the changes in the values of , which is the weight of the sparsity term of the objective. This term plays the role of feature selection over the convolutional representation of the input data. The stability over the changes of implies that the convolutional features extracted by our model already give good performances; thus, the feature selection does not significantly improve the performances.(iv)For the parameter , the average accuracy improves slightly when its value increases until it reaches 100; then the performances seem to decrease slightly. This suggests that the consistency between sparsity and low-rank somehow improves the performance, but it does not always help. For forcing the consistency with a large weight for the consistency term, the performance will not be improved.

4. Conclusion

In this paper, we proposed a novel deep learning method for the multitask learning problem. The proposed deep network has convolutional, max-pooling, and fully connected layers. The parameters of the network are regularized by low-rank to explore the relationships among different tasks. Meanwhile, it also has the function of deep feature selection by imposing sparsity regularization. The learning of the parameters are modeled as a joint minimization problem and solved by an iterative algorithm. The experiments over the benchmark datasets show its advantage over the state-of-the-art deep learning-based multitask models.

Data Availability

The large-scale CelebFaces Attributes (CelebA) dataset used to support the findings of this study have been deposited in the CelebA repository at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html. The Annotated Corpus for Named Entity Recognition data used to support the findings of this study have been deposited in the entity-annotated-corpus repository at https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the MOE Project of Humanities and Social Sciences of China under Grant No. 19YJAZH076, and the “Thousand People Plan” Specially Invited Expert for Young Professionals in Shaanxi Province, Shaanxi Soft Science Research Program under Grant No. 2018KRM065.