Abstract

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.

1. Introduction

Effectively identifying the molecular properties (e.g., bioactivity and toxicity) plays an essential part in drug discovery and material science, which can alleviate the costly and time-consuming process in comparison to the traditional experiment methods [1]. Such a process is usually known as molecular property prediction, and it is a fundamental task to explore the functionality of new drugs. A typical molecular property prediction system takes the drug features of descriptors as the input and outputs the predicted result of predefined chemical properties. The predicted value can benefit various subsequent tasks, including virtual screening [24] and drug repurposing [57]. However, accurately predicting molecular property with computational methods remains challenging.

Previous machine learning approaches focused on designing a variety of expert-engineered descriptors or molecular fingerprints manually based on experimental statistics to predict molecular property [810]. For example, extended-connectivity fingerprint (ECFP) [11], as the most representative fingerprint method, was designed to generate different types of circular fingerprints that extracted the molecular structures of atom neighborhoods by using a fixed hash function [12]. Then, these obtained fingerprint representations would be sent to traditional machine learning models to perform further predictions, and it can be applied to a wide range of different models, such as logistic regression, support vector classification, kernel ridge regression, random forest, influence relevance voting, and multitask networks [13]. However, this line of researches heavily depends on the design of hand-crafted features and domain knowledge. Besides, the generated hash bit vectors are difficult to biologically understand the relationship between chemical properties and molecular structures.

Inspired by the remarkable achievements that deep learning has shown in a variety of domains, including computer vision [14] and natural language processing [15, 16], it also has gained lots of attention for molecular property prediction. The molecular representation methods being introduced can be mainly summarized into two parts: sequence-based and graph-based approaches. For sequence-based methods, simplified molecular input line entry specification, shortened as SMILES, is the most common molecular linear notation that encodes the molecular topology on the basis of chemical rules [17]. In this way, several methods are attempted to take SMILES representation as the input and use current successful models (e.g., recurrent neural networks) to obtain molecular representations [18], while this line of work suffered from insufficient labeled data for specific molecular tasks. More recently, researchers adopted the unsupervised and pretraining strategies in natural language processing (NLP) to learn contextual information from large unlabeled molecular datasets. For example, an unsupervised machine learning method named Mol2vec was developed to learn vector representations of molecular substructures [19]. And SMILES-BERT was proposed to pretrain the model through a masked SMILES recovery task by designing attention mechanism-based transformer layer [20]. These pretrained methods pay more attention to the contextual information of molecular sequences, but they hardly consider some molecular substructure (i.e., functional groups) that essentially contributes to the molecular property [21, 22].

On the other hand, graph neural networks (GNNs) have been adopted to explore the graph-based representation for molecular property prediction [2325]. Graph convolutions were the first work that applied the convolutional layers to encode molecular graph into neural fingerprints [26]. Similarly, much efforts are made to extend a variety of GNNs on property prediction tasks. For example, the weave featurization encoded chemical features to form molecule-level representations [27]. And some methods extended graph attention network [28] to learn the aggregation weights [25, 29]. Moreover, to better encode the interactions between atoms, a message passing neural network named MPNN was designed to utilize the attributed features of both atoms and edges [30]. More recently, DMPNN [31] and CMPNN [32] were further introduced to leverage the attributed information of nodes and edges during message passing. Although graph-based models have achieved great performance on molecular graph representation, they seldom make use of the vast available biological sequence data.

Recently, substantial pretrained models [3337] trained on the large corpus or unlabeled data can learn universal representations, which are benefit for various downstream tasks, including protein sequence representation [38, 39], biomedical text mining [40, 41], and chemical reaction prediction [42]. Advances in pretrained models have shown their powerful ability for extracting information from unlabeled sequences, which raises a tantalizing question: can we develop a pretrained model to extract useful molecular substructure information from massive SMILES sequence datasets? To help solve this problem, we propose a novel neural framework, named Mol-BERT, tailored for molecular property prediction. The idea of Mol-BERT is natural and intuitive. Our framework consists of three types of modules. The feature extractor is first to extract atom-level and substructure features centered on the current atom, and the first module can be replaced with a wide range of different molecular representation methods. Then, the pretrained BERT module learns molecular substructure or fragment information from large pretraining corpus (i.e., unlabeled SMILES sequences). The final module is to predict the specific molecular property after fine-tuning the pretrained Mol-BERT via a multityped classifier. To illustrate the performance of our proposed method in various prediction tasks, Mol-BERT is fine-tuned and evaluated on 4 widely used molecular benchmark datasets. In comparison to state-of-the-art baselines (i.e., sequence- and graph-based methods), the experimental results prove the effectiveness of our proposed Mol-BERT.

This paper is organized as follows. Section 2 firstly introduces the preprocessed corpus for Mol-BERT pretraining and several molecular benchmark datasets used in this work. Then, Section 3 presents the molecular representation method, the pretraining, and fine-tuning of the Mol-BERT model, respectively. Moreover, Section 4 analyzes the prediction performance of our proposed method on several molecular datasets and compares it with state-of-the-art sequence-based and graph-based approaches. Finally, the conclusion of this work is summarized in Section 5.

2. Materials

The corpus of chemical compound (i.e., unlabeled SMILES) was obtained from the available ZINC and ChEMBL databases. As a free and available database for virtual screening, ZINC database contains over 230 million purchasable compounds in multiple formats, including ready-to-dock and 3D structures [43]. And ChEMBL database is a manually built database of bioactive molecules with drug-like properties, which collects 1,961,462 distinct compounds [44]. Specifically, we selected compound SMILES from ZINC version 15 and ChEMBL version 27 that can be processed by RDKit software [45], and the duplicates were removed in merged dataset. Moreover, we filtered them by following the same criteria of Mol2Vec [19]. Specifically, the two databases were firstly merged, and duplicates were removed. Then, only compounds SMILES that could be processed by RDKit were kept, and they were filtered according to the following cutoffs and criteria: molecular weight between 12 and 600; heavy-atom count between 3 and 50; clogP21between 5 and 7; and only H, B, C, N, O, F, P, S, Cl, and Br atoms allowed. Additionally, all counterions and solvents were removed, and canonical SMILES representations were generated by RDKit. Finally, this procedure yielded 4 million compounds. Detailed information on the pretraining corpus is provided in Supplementary (available here).

In this paper, we selected 4 widely used benchmark datasets from MoleculeNet [13] to evaluate the performance of our proposed method. SMILES strings were used to encode the input chemical compound in all benchmark datasets. The benchmark datasets we used are introduced as follows:(i)BBBP. The BBBP dataset provides 2,053 compounds on their permeability properties to predict the barrier permeability(ii)Tox21. The Tox21 dataset measures 8,014 compounds with their corresponding toxicity data against 12 targets. The label of toxicity is recorded as binary task: if the label value is 1, then it means the compound has toxicity on specific target or 0 otherwise(iii)SIDER. The SIDER dataset contains a total of 1,427 compounds and their adverse drug reactions (ADR) against 27 system-organ class. The ADR result is described as binary labels(iv)ClinTox. The ClinTox dataset provides 2 classification tasks for 1,491 drug compounds with known chemical structures, including clinical trial toxicity and FDA approval status

In this paper, we followed the experimental setting of FP2VEC [46], and we split the datasets into the train, validation, and test set with a ratio of 8/1/1. Table 1 shows the detailed description of selected benchmark datasets. Please note that binary and multilabel correspond to the binary and multilabel classification tasks, respectively. And random splitting method randomly splits the samples into training, validation, and test subsets. Scaffold splitting method splits the samples on the basis of their 2D structural frameworks implemented by RDKit software.

3. Methods

In this section, we first describe the overview of our proposed Mol-BERT; then, we separately introduce three modules, which we refer to as the feature extractor, pretraining, and fine-tuning of Mol-BERT, respectively.

3.1. Overview

Figure 1 illustrates the overall process of Mol-BERT. As shown in Figure 1, Mol-BERT consists of three modules, including feature extractor, pretraining, and fine-tuning of Mol-BERT. The Mol-BERT framework learns to predict the molecular property as follows. Given the input drug data (i.e., canonical SMILES), the featurizer module adopts the effective molecular representation to transform them into a set of atom identifier (recall the detail in Feature Extractor). Then, the outputs are fed into a BERT module to obtain a contextual embedding of each molecular substructure through pretraining BERT on vast preprocessed corpus (recall the detail in Pretraining Mol-BERT). Finally, the fine-tuned Mol-BERT outputs a value indicating the probability of certain molecular property in classification task (recall the detail in Fine-Tuning Mol-BERT).

3.2. Feature Extractor

The molecular substructure is an important cue for molecular interactions [21, 22]. Therefore, the key idea behind Mol-BERT is that we strengthen to obtain a better representation of molecular substructures by pretraining BERT on the vast unlabeled SMILES sequences. Inspired by Mol2Vec [19] that considered molecular substructures or fragments derived from the Morgan algorithm as “words” and compound as “sentences,” here we adopt a similar method to decompose the input SMILES sequences into biological words and sentences.

To achieve it, given an input compound SMILES string, we first obtain its standardize and canonical SMILES representation generated by RDKit. Then, the Morgan algorithm [11] is used to generate all atom identifiers with radius 0 and 1, denoted by and i, respectively, where the subscript represents the index of each atom. As illustrated in the left part of Figure 1, (i.e., green node) represents the current node set traversed in an atom order while (i.e., Kelly node) represents the neighboring node set connecting directly to the current atom, so an be viewed as a kind of substructure or fragment. And are then hashed into a fixed-length vector. Take CC(N)C(=O)O as an example; it consists of six atoms, and we obtain its atom identifiers (i.e., -) and the corresponding substructures (i.e., -), and then, they are hashed into a fixed-length vector (e.g., corresponds to 3537119591). Finally, all vectors of the Morgan substructures are summed to obtain the molecular representation. Therefore, in this way, we can generate 119 atom identifiers at radius 0 and 13325 substructure identifiers at radius 1, respectively. The feature extractor module in Mol-BERT can be replaced with various molecular representation methods. For example, FP2Vec [46] can be used as the feature extractor to generate the 1024-bit Morgan (or circular) fingerprint with the predefined radius value.

3.3. Pretraining Mol-BERT

As a contextualized word representation model, BERT [33] adopted the masked technique to predict randomly masked words in a sequence, which can result in learning bidirectional representations. Therefore, Mol-BERT also uses a masked SMILES task (i.e., atom identifier) to predict random substructure in a SMILES string. Different from the traditional way of pretraining language models in NLP that BERT was trained on English Wikipedia and BooksCopus, in this paper, we pretrain Mol-BERT on our preprocessed corpus obtained from ZINC version 15 and ChEMBL version 27 datasets. Specifically, the input SMILES is transformed into a list of atom identifiers via a previous module, rather than character-level for SMILES [20], and then, they are embedded as the input of BERT module for pretraining. We initialized our proposed Mol-BERT with weights from BERT [33] and follow the same way to randomly mask 15% tokens in a SMILES (i.e., atom identifier) as [MASK] token. The tokens are embedded into the feature vector. Here, we use token embedding and positional embedding since only the Masked Language Model (MLM) task is adopted in this paper. The proposed Mol-BERT is different from BERT in several ways as follows: (1) Mol-BERT adopted single masked SMILES task (i.e., MLM) on large-scale unlabeled datasets, while BERT uses two kinds of self-supervised tasks on English Wikipedia and BooksCopus, and (2) w exclude the segmentation embedding adopted in the BERT model since Mol-BERT does not require the continuous sentence training.

3.4. Fine-Tuning Mol-BERT

After pretraining on the vast of unlabeled SMILES compounds, with minimal modification of hyperparameters, Mol-BERT can be applied to molecular property prediction on various downstream tasks. We mostly follow the same architecture, optimization, and hyperparameter choices used in [8]. For classification task (i.e., BBBP and Tox21), we feed the final BERT vector into a linear classification layer to predict the molecular property. A simple classifier is adopted to output the binary value. Then, the labeled sample is used for fine-tuning the model. Mol-BERT feeds the learned drug embeddings into a multityped MLP classifier to generate predictions. Output scores include both continuous scores, such as the solubility value and as binary outputs indicating whether a molecule is toxic or nontoxic. The multityped classifier detects whether the task is regression or classification and switches to the correct loss function and evaluation metrics. In the case of regression, we use the mean square error (MSE) as the loss function and root mean square error (RMSE) as performance metrics. In the classification case, we use binary cross entropy as the loss function and area under the receiver operating characteristics (AUC-ROC) as performance metrics. Given a set of SMILES compounds and the ground-truth labels in the training dataset, we used the crossentropy and the mean square error as loss function for classification and regression tasks, respectively.

4. Results and Discussion

In this section, we first introduce the experimental settings. Then, we demonstrate the performance of our proposed Mol-BERT in comparison to state-of-the-art methods to predict the molecular property on 4 wildly used benchmark datasets.

4.1. Baseline Methods

We compare Mol-BERT with many state-of-the-art sequence-based and graph-based baselines which can be categorized as follows:(i)ECFP: extended-connectivity fingerprints, referred to as ECFP [11], are a type of widely used circular or Morgan fingerprints for encoding the substructures in a molecule(ii)GraphCov: graph convolutions are proposed by [26] to apply the convolutional networks for learning molecular fingerprints. Here, we term it as GraphCov(iii)Weave: similar to GraphCov, the weave featurization [27] encodes meaningful features of atom, bond, and graph distances between matching pairs to form molecule-level representations(iv)MPNN: a novel message passing method is proposed to be operated on undirected graph [30](v)FP2VEC: based on Morgan or circular fingerprint, it introduces and encodes a molecule as trainable vectors [46](vi)SMILES-BERT: [20] proposes a semisupervised BERT model that takes the SMILES representation as input

We report the results of these baselines in FP2Vec [46], including ECFP, GraphCov, Weave, and FP2VEC. And we reimplemented MPNN and SMILES-BERT, respectively. As for MPNN [30], it is a graph-based model considering the edge features during message passing. And SMILES-BERT [20] is a sequence-based model based on transformer layer and attention mechanisms entirely to encode compound SMILES. These models are relied on the public code and kept the same settings of models the same as reported in the original papers.

4.2. Evaluation Metrics

We applied the area under the receiver operating characteristic curve (AUC-ROC) metric for classification task. Following [46], we train the prediction model with a train set and optimize the model based on the AUC-ROC metric of validation set for classification task. And the prediction results are measured using those optimized models on the test set. For all experiments in this paper, we repeated the same procedures on each task for 5 times and reported the mean and standard deviation of AUC scores. Besides, we evaluated all models on the scaffold splitting method as reported by [46].

4.3. Implementation Details

To optimize all trainable parameters, we adopt Adam optimizer for pretraining and fine-tuning. The dynamic learning rate technique is adopted to adjust the learning rate during training and fine-tuning according to various downstream tasks. We use PyTorch to implement Mol-BERT. And we use 3 NVIDIA GTX 1080Ti GPUs to pretrain Mol-BERT. All fine-tuning tasks are run on a single NVIDIA GTX 2080Ti GPU. Table 2 shows all the hyperparameters of the fine-tuning model.

4.4. Comparison Results

To examine the competitiveness of the proposed model, we compared Mol-BERT with state-of-the-art models used for molecular property prediction on classification task. Table 3 reports the mean and standard deviation of ROC-AUC score on BBBP, SIDER, Tox21, and ClinTox datasets. From this table, we can observe that the proposed Mol-BERT significantly outperforms the baselines across three datasets, including Tox21, SIDER, and ClinTox. More specifically, our proposed Mol-BERT achieved at least 2.9% on Tox21, 2.2% on SIDER, and 4.4% on ClinTox higher ROC-AUC metric than baselines. For example, on the Tox21 dataset, Mol-BERT achieved a ROC-AUC score of 0.839 with 2.9% absolute gain compared to ECFP (the second best method). This is because Mol-BERT leverages the molecular representation pretrained on large- scale unlabeled SMILES sequences, while ECFP heavily relied on feature engineering. Compared with graph-based methods that explore the molecular graph features, the proposed Mol-BERT outperformed them on three datasets while it achieved comparable performance with MPNN on the BBBP dataset. This is due to the fact that the contextual information learned from large unlabeled datasets can benefit a lot to the model performance. Moreover, in comparison to the sequence-based pretrained model (i.e., SMILES-BERT), our proposed Mol-BERT achieved stable performance across all datasets. This is a very encouraging result. The reason could be that our method adopted the molecular representation to consider the structural feature of molecular substructures, which benefits to the performance. Overall, it is essentially a nontrivial achievement in terms of molecular property prediction.

5. Conclusions

In this paper, we proposed an effective molecular representation method with the pretrained BERT model, named Mol-BERT, to resolve the molecular property prediction. Our proposed Mol-BERT leverages the molecular representation of substructures pretrained on large-scale unlabeled SMILES dataset, which is able to learn both structural and the contextual information of drug. We implement the proposed method and conduct experimental comparisons on four widely used benchmarks. The experimental results show that Mol-BERT outperforms the classic and state-of-the-art graph-based models on molecular property prediction.

While our proposed method achieves good performance on classification tasks, there are still some limitations expected to be overcome. First, our method achieves relatively poorer performance on regression task, mainly owing to the small number of samples in the dataset (e.g., FreeSolv). We would like to investigate metalearning strategies for data augmentation, which results in great success in natural language processing. Second, molecular property prediction is the primary step in drug discovery; we will continue to improve our method to further investigate the following prediction task (e.g., protein-protein interaction, drug-disease associations) in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request (https://github.com/cxfjiang/MolBERT).

Conflicts of Interest

The authors declare no competing financial interest.

Supplementary Materials

The pretraining corpus are available at “https://drive.google.com/drive/folders/1ST0WD1-hX9XtiPWwCceZbgZlBV0fKPbe.” (Supplementary Materials)