Abstract

Legal judgment prediction (LJP), as an effective and critical application in legal assistant systems, aims to determine the judgment results according to the information based on the fact determination. In real-world scenarios, to deal with the criminal cases, judges not only take advantage of the fact description, but also consider the external information, such as the basic information of defendant and the court view. However, most existing works take the fact description as the sole input for LJP and ignore the external information. We propose a Transformer-Hierarchical-Attention-Multi-Extra (THME) Network to make full use of the information based on the fact determination. We conduct experiments on a real-world large-scale dataset of criminal cases in the civil law system. Experimental results show that our method outperforms state-of-the-art LJP methods on all judgment prediction tasks.

1. Introduction

Legal judgment prediction (LJP) aims to predict the judgment results according to the information based on fact determination, which consists of the fact description, the basic information of defendant, and the court view. LJP techniques can provide inexpensive and useful legal judgment results to people who are unfamiliar with legal terminologies, and they are also helpful for the legal consulting. Moreover, they can serve as a handy reference for professionals (e.g., lawyers and judges), which can improve their work efficiency.

LJP is regarded as a classic text classification problem and has been researched for many years [1]. For example, Liu et al. proposed to extract shallow textual features (e.g., Chinese characters, words, and phrases) for charge prediction [2]. Katz et al. predicted the US Supreme Court’s decisions based on efficient features from case profiles [3]. Luo et al. combined the fact description with the corresponding law articles to predict the charges [4]. Although great progress has been made in the LJP, there still exist some problems, such as multiple subtasks, topological dependencies between subtasks, and cases of similar descriptions with different penalties. Zhong et al. pointed out that law articles prediction was one of the fundamental subtasks in some countries (e.g., China, France, and Germany) with the civil law system, and these subtasks had a strict order in the real world [5]. Further, Yang et al. proposed a neural model for the interaction between subtask results [6].

Despite these efforts in designing efficient features and employing advanced Natural Language Processing (NLP) techniques, LJP still confronts two major challenges.

1.1. The Lack of External Information

Some existing works propose various mechanisms to extract information from the fact description, such as the Word Collection Attention mechanism. Some other works propose various frameworks to build the dependencies between subtasks, such as DAG Dependencies of Subtasks and MPBF. However, for the judgment document in Figure 1, there are many other information items that can be utilized except the fact description. Such information is called the external information including the basic information of defendant and the court view. Therefore, how to utilize the external information effectively is a major challenge.

1.2. Encoding Long Document Is Difficult

The fact description in judgment document is often long document containing the long-term dependency problem. Many existing models, such as Recurrent Neural Network (RNN) [7] and Convolutional Neural Network (CNN) [8], which perform well in the text processing are unable to deal with the long-term dependency problem. There are some keywords in the judgment document that are very important for LJP. It is very difficult to find them in the judgment document.

In order to resolve the above challenges, in this paper, we propose the Transformer-HAN-Multi-Extra (THME) Network. It contains a structured data encoder to extract the semantics of the external information as well as a Transformer-Hierarchical Attention Network (TH) encoder to encode the fact description. Specifically, as shown in Figure 1, from the basic information of the defendant, we can get the defendant’s gender, age, and education level and the content related to the criminal records of the defendant by using regular expressions. Similarly, we can get some objective attributes of a case, such as amount, plot, and consequences, from the court view. Based on the statistical analysis of large samples, we can find the relationship between the data and the terms of penalty as is shown in Table 1, where the symbol “” represents “related.” For example, given the same conditions, male’s terms of penalty is longer than female’s for certain cases. We use the symbol “” to denote positive correlation. For example, the more serious the case’s plot is, the longer the defendant’s terms of penalty will be. We use the symbol “” to denote negative correlation. For example, the better the defendant’s guilty attitude is, the shorter the defendant’s terms of penalty will be. It is worth noting that the case’s conclusion in judgment document is significant for terms of penalty but it cannot be used as an input to predict the terms of penalty. If it is used as an input to predict the terms of penalty, it seems like that the cat shuts its eyes when stealing. Therefore, we first use the external information to predict the case’s conclusion and then use it together with the external information to predict the terms of penalty. Meanwhile, according to the data attributes, we divide the data into continuous and discrete types. Then, we extract the required information via the continuous data encoder and the discrete data encoder. In order to reduce the information loss in the process of converting sentences into fixed-length vectors, an attention mechanism is adopted. But, it cannot solve the polysemy problem. Then, we choose a proper Transformer [9]. Transformer has attention structure; it has advantages over the RNN in solving long-term dependency problem and performs better than attention on polysemy. The Hierarchical Attention Network (HAN) can catch the keywords in a long document easily [10]. Thus, we can combine the Transformer with the HAN to solve the long-term dependency problem. Experimental results show that the performance of Transformer-HAN is better than Gate Recurrent Unit (GRU)-HAN.

The main contributions of this paper are summarized as follows:(i)We propose a novel text processing structure, namely, Transformer-HAN, to improve the text encoding ability. This model can solve the long-term dependency problems better than the GRU-HAN. Transformer-HAN encoder uses the attention mechanism in addition to the necessary fully connected layer of the parameter matrix, and it works much faster than the encoder structure based on GRU and Long Short-Term Memory (LSTM).(ii)We propose a structured data encoder. To introduce the external information as an auxiliary, we extract fact-related data from the defendant’s basic information and the court view as supplementary information of the model. According to different attributes of data, we design both continuous and discrete data encoders. Experiments show that information based on fact determination can effectively improve the judgment prediction, especially for the prediction of the terms of penalty.(iii)Experimental results show that the THME Network can effectively improve the prediction accuracy of few-shot data. The macro-average indicators of the three tasks of law article prediction, charge prediction, and terms of penalty prediction are relatively improved compared with other models, which indicates that the prediction accuracy of few-shot data has been greatly improved.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work. In Section 3, we propose the overall THME framework and detailed methods. The experimental results and analyses are presented in Section 4. Finally, Section 5 contains the concluding remarks.

2.1. Legal Judgment Prediction

With the development of Chinese legal digitalization process, as one of the most critical task steps in LegalAI, LJP has become more and more important. Thanks to the development of machine learning and text mining techniques, more researchers formalize this task under text classification frameworks. Most of these studies attempt to extract textual features [1113] or introduce some external knowledge [4, 14]. However, these methods can only utilize shallow features and manually designed factors; usually the effect of these methods becomes worse when applied to other scenarios. Therefore, researchers take advantage of other technologies to improve the interpretability and generalization of the model. For example, Jiang et al. utilized the deep reinforcement learning to derive short snippets of documents from the fact descriptions to predict charges [15], and Chen et al. proposed a Legal Graph Network (LGN) to achieve high-precision classification of crimes [16]. Due to the rareness of some types of cases in real life, the few-shot problem is inevitable. While some researchers hardly solve this problem using machine learning, others find that neural networks have good results. For example, Chen et al. proposed a neural network model by embedding law articles and fact descriptions into the same embedding space in the same way [17]. Yang et al. proposed a repeated interactional mechanism to simulate the process of judge’s decision [18].

2.2. Multitask Learning

Multitask models have many beneficial effects for deep learning tasks. Sulea et al. proposed multiple tasks, which include law articles predictions, charge predictions, and terms of penalty predictions, to test the application of machine learning in the judicial field [19]. Zhong et al. proposed a topological structure network, which can simulate the judge’s judgment process to improve the performance of various tasks. Yang et al. designed a Multi-Perspective Bi-Feedback Network (MPBFN) to enhance the connection between tasks and allow tasks’ results to flow in both directions. Wang et al. set the relationship between law articles as a tree structure via a Hierarchical Matching Network (HMN) and matched relevant law articles via a two-layer matching network [20], which can improve the work efficiency.

The emergence of multitask learning has promoted the development of LJP; however, due to the lack of external information, it has also resulted in unsatisfactory prediction of terms of penalty. In this work, we propose a framework to utilize the external information effectively. Different from most existing works, we extract the information from both the fact description and the external information and merge them together into a topological classifier to predict the three subtasks of LJP.

3. Method

In this section, we will describe the THME Network. We first give the essential definitions of the LJP task and the composition of THME Network in Sections 3.1 and 3.2, respectively. We describe a text encoder for fact descriptions in Section 3.3. We introduce the structured data encoder in Section 3.4. Finally, the classifier is proposed in Section 3.5.

3.1. Problem Formulation

In most tasks of the Chinese text processing, the char-granularity processing is superior to the word-granularity processing [21], so for each judgment document, we set each Chinese character as a token. The fact description is a token sequence , where is the number of tokens. This can reduce the complexity of model and make it fit easier. Besides the input , the basic information of the defendant and the court view are also deemed as external inputs of the structured data encoder. Given these inputs, we will predict the judgment results of applicable law articles, charges, and terms of penalty, which is a multitask classification problem.

3.2. Overview

Our THME consists of three parts, i.e., the text encoder, the structured data encoder, and the classifier. The text encoder is composed of text embedding layer, text convolution layer, main encoder layer, and information extraction layer. Due to different attributes of the structured data, we divide structured data into discrete data and continuous data, for which we propose discrete data encoder and continuous data encoder, respectively. The classifier is implemented with a topological structure, which utilizes the topological dependencies between subtasks in LJP. The general framework of the THME is shown in Figure 2.

We employ a text encoder to extract the information from the fact description; the fact description is embedded into CNN, so that advanced features are gradually extracted from the shallow textual features. represents the j-th Chinese character in the i-th sentence. The main encoder layer is actually Transformer-HAN, which includes two layers: the first layer aggregates token-level features into sentence-level features, and the second layer aggregates sentence-level features into text-level features. Finally, we generate four hidden-layer states corresponding to three subtasks of LJP and corresponding to the case’s conclusion which is critical in predicting the terms of penalty through the information extraction layer. Next, we employ the regular expression to extract the discrete data and the continuous data from the external information. Then, we standardize the continuous data, embed the discrete data, and input them into the discrete data encoder and continuous data encoder, respectively. The outputs of these two encoders are combined to generate the structured data vector . and the hidden-layer state are concatenated into a full connection network to predict the case’s conclusion . The case’s conclusion vector and the structured data vector make up the output of the structured data encoder . Finally, and the hidden-layer state of all subtasks in LJP are concatenated into the classifier with topological structure to predict the law articles, charges, and terms of penalty.

3.3. Text Encoder for Fact Description

We employ a text encoder to generate the vector of fact description as the input of the classifier. We will briefly introduce this encoder which is composed of lookup layer, convolution layer, Transformer-HAN layer, and information extraction layer.

3.3.1. Lookup and Convolution

Taking a token sequence as input, the encoder computes a simple text representation through two layers, i.e., lookup layer and convolution layer.

(1) Lookup. We first convert each token in into a natural number by preprocessed dictionary mapping. The token sequence is converted into an integer sequence . Next, we propose an initialized word embedding sequence , where is the size of dictionary. is mapped to via the word embedding sequence . Thus, we can obtain the text embedding sequence , where is the length of word embedding.

(2) Convolution. For , we make a convolution operation with the convolution matrix given bywhere is the concatenation of word embeddings in the -th window, is the bias vector, is the number of filters, and is the size of a sliding window. We apply the convolution over each window and finally obtain . The Chinese character vector after convolution has n-gram features; that is to say, the Chinese character vector after convolution has context features and is no longer isolated.

3.3.2. Transformer-HAN Encoder and Information Extraction

(1) Transformer-HAN encoder. Transformer is currently the most mainstream information extractor, mainly due to its unique attention mechanism, which achieves the true bidirectional encoding. However, the number of parameters of the multilayer Transformer encoder is very huge. In order to fully take advantage of Transformer and meanwhile constrain the number of parameters, we design the Transformer-HAN as our main encoder.

Transformer-HAN encoder is divided into two layers: the first layer uses Transformer for Chinese character-granularity coding, then uses the attention mechanism to extract the most important information in each word embedding, and combines them into sentence vectors. The second layer uses Transformer for sentence-granularity coding, then uses the attention mechanism to extract the most important information in sentence vectors, and combines them into a chapter-granularity vector. Therefore, the fact description is divided into sentences , and the i-th sentence consists of Chinese characters , where .

Since the Transformer encoder is less sensitive to the position of Chinese characters, we need to add the position embedding to the word embedding before input. For Chinese character in the j-th sentence , we calculate its position vector aswhere is the position of this Chinese character in the sentence, is the index of the -th value in its word embedding, and is the dimension of its word embedding. The position vectors of all Chinese characters in the sentence form the sequence . Then, we merge the position sentence with to obtain the sentence sequence with the information of position given bywhere is an element-wise addition operation.

The Transformer encoder is composed of Multihead Attention (MHA), Add & Norm Layer, and Feed Forward (FF). Multihead Attention is composed of Self-Attention, for which the inputs , , and are the same. Multihead Attention converts , , and into , , and through linear transformation by using a parameter matrix. Next, we apply the Self-Attention mechanism to extract the semantic information. This process is repeated times. The results are concatenated together, and then the linear transformation is performed. The calculation process is given as follows:where () is the vector concatenation operation, is the size of head, and are the parameter matrices.

Add & Norm Layer contains the Add layer and the Norm layer. First, we merge the input of Multihead Attention with the output of and obtain the fact semantic vector as

There are two reasons for this: First, it can make up for the lack of information. Second, it is equivalent to introducing a highway in the network. When the network is backpropagating, a part of it can be directly propagated into the original information without going through the complex network, preventing gradient explosion or gradient disappearance. Then, we employ the Layer Normalization [22] to normalize and obtain . Therefore, we obtain the sentence sequence aswhere are the parameter matrices and are the basic vectors. Then, we use the attention vector to extract the main information. In order to get the sentence vector , we initialize an attention vector and obtain as

Similarly, we get the sentence sequence . The sentence encoder is basically the same as the Chinese character encoder. The difference is that the token vector is replaced with a sentence vector which is produced by the Chinese character encoder.

Since we still use the Transformer to encode the sentence sequence, we first calculate the sentence’s position vector and merge it with the sentence sequence by

As the input of the Transformer, passes the Transformer’s MHA, Add & Norm Layer, and Feed Forward to obtain a new sentence sequence , which has higher-level characteristics and more comprehensive and useful information.

(2) Information extraction. Finally, for our three subtasks of LJP and case’s conclusion, we need four different attention vectors to extract four different kinds of information from the same information sequence. We first initialize four attention vectors and obtain the vector aswhere is the fully connected matrix and is the bias vector.

3.4. Structured Data Encoder

The deep learning model is like a judge. We train the model and keep feeding data to the model, just like constantly showing different cases to the judge and training the professional quality of the judge. However, most of the previous work only gave the model to “see” the fact description. In practice, the judge would not sentence the defendant only based on the fact description at the time of judging. In the process of judgment prediction, we sometimes need some explicit data to convict and sentence the defendant. For example, information such as the defendant’s guilty attitude, whether to commit recidivism, and the amount of money involved directly affect the final judgment. Based on the above facts, we use the regular expression to extract discrete data and continuous data from the external information, as shown in Tables 2 and 3. In order to well integrate data into THMA, we design both the discrete data encoder and the continuous data encoder, as shown in Figure 3.

3.4.1. Continuous Data Encoder

We normalize each category of continuous data aswhere is the mean of continuous data and is the variance. We can obtain the continuous data sequence , where is the number of types of continuous data. Then, we employ a full connection network to fuse different types of continuous data and obtain the continuous data vector aswhere is the fully connected matrix, is the bias vector, and .

3.4.2. Discrete Data Encoder

Since there are few discrete data categories, we use the word embedding method to create a discrete data vector space for each category of discrete data. We convert each category of discrete data into its word embedding . Similarly, we obtain the discrete data vector aswhere is the fully connected matrix, is the bias vector, and . The discrete data sequence is then represented aswhere is the number of categories of discrete data.

3.4.3. Case’s Conclusion Prediction

The specific content of the case’s conclusion is presented in Table 4.

In order to predict the case’s conclusion, we firstly obtain the combination of discrete data sequence and continuous data vector as , given by

Case’s conclusion is very helpful for LJP, especially for the prediction of terms of penalty. For prediction of case’s conclusion, the input is the concatenation of the case’s conclusion corresponding vector and . Similarly, we obtain the vector of case’s conclusion aswhere is the fully connected matrix, is the bias vector, and . Finally, we obtain the output of the structured data encoder as

3.5. Classifier

When a judge decides a case, he/she often first searches for the legal basis related to this case such as the fact description. Then, according to the relevant laws, the conviction is made. Finally, intergrating all the evidence and facts, the judge passes the sentence. Therefore, there are topological dependencies among multitask results [5]. We evaluate the performance on three LJP subtasks, including law articles (denoted as ), charges (denoted as ), and terms of penalty (denoted as ). Note that we implement the classifier with dependency in Figure 2; i.e.,where represents the input of and is the empty set. This means that the charge prediction depends on law articles, and the terms of penalty prediction depend on both law articles and charges. Such explicit dependencies conform to the judicial logic of human judges, which will be verified in later sections. In order to combine the fact description and the structured data, we concatenate the structured data vector and the -th subtask’s corresponding vector to obtain the vector as

Considering the topological dependencies between subtasks, we predict the law article first, then the charge, and finally the terms of penalty. We obtain the law article’s vector as

The processes of charge prediction and terms of penalty prediction are similar with the law article prediction. Different from the law article prediction, the input of the charge prediction is the concatenation of and , while the input of terms of penalty prediction is the concatenation of , and . Finally, we obtain , and , where are the number of categories of label for subtasks 1, 2, 3, respectively. In order to learn parameters of THME model, we use the Adam algorithm [23]. We adopt the cross-entropy loss in the training process as follows:where is the prediction result, is the real result, is the law articles prediction, and is the -th sample. Equation (20) represents the loss function of one sample in the prediction of the law articles. When there are multiple samples, we add all the losses together to form the total loss of the law articles. We have three subtasks, so the sum of losses of the three subtasks constitutes the final loss of the model. We train our model in an end-to-end fashion and utilize the dropout [24] to prevent overfitting.

4. Experiments

In this section, we verify the effectiveness of our proposed model. We first introduce the datasets and the data processing. Then, we provide the necessary parameters of our model. Finally, we did some experiments to verify the advantage of our model and the importance of external information.

4.1. Dataset Construction

Since there are no publicly available LJP datasets in previous works, we collect and construct an LJP dataset CJO. CJO consists of criminal cases published by the Chinese government from China Judgment Online1. The data used in this experiment is all from the judgment documents published by the Supreme People’s Court of China. Before the formal data processing, we first clean the data. Our experiment aims at criminal offense, so other types of judgment documents except criminal offense are screened out. Then, we filter out the multi-criminal judgment documents. The structure of the multi-criminal judgment documents is complicated, and we will research it in our future work. The terms of penalty for a single-criminal judgment document are up to 25 years, so we screen out the judgment documents with the terms of penalty more than 25 years (except death penalty and life imprisonment). Finally, we screened 5480000 judgment documents and obtained 750000 available data pieces. We used the selected 750000 pieces of data for experiments.

Our model’s inputs include the token sequence , the discrete data, and the continuous data. However, we find that our processing approach is not suitable for the terms of penalty of previous convictions. It cannot solve the problem of uneven distribution. Therefore, we discretize the terms of penalty. The specific method is shown in Table 5.

For the majority data in the CJO dataset, their terms of penalty are no longer than 12 months. Meanwhile, the amount of data decreases as the terms of penalty increase. Especially for those with terms of penalty longer than 3 years, the amount of data has dropped significantly. In order to solve the problem of uneven distribution, we use small intervals where data is dense and large intervals where data is sparse, so as to ensure the stability of the amount of data in each interval.

4.2. Baselines

To evaluate the performance of our proposed THME framework, we employ the following text classification models and judgment prediction methods as baselines:(i)Fact-Law Attention Model [4]: It was proposed by Luo et al. in 2017. The main idea is embedding the law article into the model and then using the fact descriptions to extract the relevant law article to help the model get good results.(ii)TOPJUDGE [5]: It was proposed by Zhong et al. in 2018. The main idea is using the topological dependencies between subtasks to improve the task effect.(iii)MPBFN-WCA [6]: It was proposed by Yang et al. in 2019. The main idea is that repeated iterations between subtasks can reduce the error accumulations, thereby improving the effectiveness of the tasks.

4.3. Experimental Settings

We set the word embedding size as 256. For the discrete data encoder, the dimension of the discrete data embedding is 32. The dimension of the output vector of the discrete data encoder is 64, the dimension of the output vector of the continuous data encoder is 64, and the dimension of the case’s conclusion’s vector is 256.

We use the TensorFlow framework to build neural networks. In the training part, we set the learning rate of Adam optimizer as 0. 0001 and the dropout probability as 0.5. The padding length of the text is 320 tokens, the length of each sentence is 16 tokens, and each text is divided into 20 sentences. We set the batch size as 256 for all models. We train each model for 256 epochs, and if overfitting occurs, we will terminate the training early.

We employ (Acc.), (MP), (MR), and as evaluation metrics. Here, the macro-precision/recall/ is calculated by averaging the precision/recall/ of each category.

4.4. Results and Analysis

All the models are repeated 3 times, and we evaluate the performance on three LJP subtasks, including law articles, charges, and terms of penalty and report the average values as the final results for clear illustration. Experimental results on the test set of CJO are shown in Table 6. It is shown that THME achieves the best performance on all metrics. Thus, the effectiveness and robustness of our proposed framework are verified. Compared with TOPJUDGE and MPBFN-WCA, THME takes advantage of the information of the fact determination and thus achieves promising improvements. It indicates that the external information enables the model to learn rules that are not in the original fact description. Compared with Fact-Law Model, our model takes advantage of the correlation among relevant subtasks and achieves significant improvements. Thus, it is important to properly model topological dependencies between different subtasks.

4.5. Ablation Study

To further illustrate the significance of modules in our framework. Compared to THME, we designed the following models:(i)Transformer-HAN-Single-Extra (THSE): We decompose the multitask model into a single-task model to verify the superiority of the multitask model.(ii)Transformer-HAN-Single (THS): In order to reflect the role of continuous data and discrete data based on the fact description in a single-task, we design THS to compare the effect with THSE.(iii)Transformer-HAN-Multi (THM): In order to reflect the role of continuous data and discrete data based on the fact description in multitasking, we design THM to compare the effect with THME.(iv)GRU-HAN-Multiextra (GHME): In order to prove the role of Transformer in the model, we design the GHME model and the THME to compare their effects.

As shown in Table 7, compared with THS, THM can improve the performance by , and for law article prediction, charge prediction, and terms of penalty prediction in our dataset, respectively. Thus, multitask model is beneficial to improve the performance of each task. THSE performs better than THS, especially in terms of penalty prediction. THSE has enhanced the performance by . Thus, the structured data based on the fact description plays an important role, even if the single-task model is also significantly better than the multitask model without the addition of structured data. Hence, the structured data plays a more important role compared with the multitask structure.

Through comparing GHME and THS, we can see that THS performs better, which indicates that the performance of Transformer is better than the traditional GRU model in handling long documents and the effect of Transformer-HAN on LJP is greater than that of the multitask topological structure and the external information. This also proves that the proposed Transformer-HAN is a state-of-the-art model to deal with long-term dependency problems.

4.6. Information Source Study

To further show the significance of the external information and explore the impacts of the information source, we evaluate the performance of THME under various information sources. We remove all the external information (fact), court view (-court view), defendant’s information (-defendant’s information), and case’s conclusion (-case’s conclusion), respectively. Results are summarized in Table 8.

It is shown that the performance of THME gets worse for all tasks after removing either origin of information. More specifically, when we remove all the external information, tremendous decrease is observed for the terms of penalty prediction. This demonstrates that the external information is beneficial for terms of penalty prediction. When we remove the defendant’s information, the performance is better than when removing the court view. This also demonstrates that the court view is more significant than the defendant’s information and it plays a decisive role in LJP. The case’s conclusion comes from the court view. When we remove the case’s conclusion, the effect of THME is worse than the situation of removing the defendant’s information, which is similar to the situation of removing the court view. This demonstrates that the case’s conclusion plays a very important role in LJP.

4.7. Error Analysis and Solution

Prediction errors induced by our proposed model can be traced down into the following causes.

4.7.1. Data Imbalance

Data imbalance is a natural phenomenon, because the number of cases with long terms of penalty is significantly less than those with short terms of penalty. Although we have adopted effective techniques to discretize the terms of penalty to reduce the impact of data imbalance, for the subtasks of law articles and charges, our model achieves more than on accuracy, while only about for macro-F1. This issue is much more severe on the subtask of the terms of penalty, for which our model yields a poor performance of only macro-F1. The bad performance is mainly due to the imbalance of category labels; e.g., there are only a few training instances where the term is “life imprisonment or death penalty.” Most judgment prediction approaches perform poorly (especially for recall) on these labels as listed in Figure 4.

4.7.2. Terms of Penalty Problem

It can be seen from the results that although our model surpasses other models in terms of penalty prediction, the effects of terms of penalty prediction is still very poor. The accuracy rate is only , and the macro-average index is even less than . Such an index is far from meeting the actual needs. The actual cases are often multiple criminal cases, which are much more complicated than the cases we are analyzing, but complex cases often contain more information, which also provide us with ideas for solving the problem of terms of penalty prediction. In multiple criminal cases, we can split the case into multiple subcases and then comprehensively consider the categories of subcases, the number of subcases, and the severity of subcases to provide more information for terms of penalty prediction. The specific implementation method remains to be explored.

5. Conclusion

In this paper, we have studied the multi-extra and multi-task of LJP with topological dependencies between subtasks and address the problem of insufficient information and insufficient coding in LJP. Based on the topological structure between multiple tasks, we extract the information from the fact description via the Transformer-HAN encoder, extract the external information from the judgment document by the structured data encoder, and then integrate them into the classifier to reduce the misjudgment of penalty prediction. Experimental results show that our model achieves significant improvements over baselines for all judgment prediction tasks.

In the future, we will seek to explore the following directions: (1) It is interesting to explore the multitask legal prediction with multiple labels and multiple defendants. In recent years, the rise of knowledge graphs and graph neural networks (GNN) has made this possible [2528]. (2) We will explore how to incorporate various factors into LJP, such as defendant’s subjective viciousness, defendant’s criminal means, and defendant’s identity, which are not considered in this work. (3) When a judge decides a case, similar cases are crucial to the judgment result for this case. Therefore, we can also recommend similar judgment documents to judges [2931]. (4) With more and more research on the transfer learning, GPT, Bert, and other natural language models are also produced and continuously improve the ability to extract information from the text. The use of transfer learning in the process of dealing with the fact descriptions may improve the effectiveness of models [3234].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Kongfan Zhu and Rundong Guo contributed equally to the paper.

Acknowledgments

This work was supported in part by the Key Research and Development Program of China under Grant no. 2018YFC0831000 and no. 2017YFC0803400.