Abstract

Early diagnosis of pandemic diseases such as COVID-19 can prove beneficial in dealing with difficult situations and helping radiologists and other experts manage staffing more effectively. The application of deep learning techniques for genetics, microscopy, and drug discovery has created a global impact. It can enhance and speed up the process of medical research and development of vaccines, which is required for pandemics such as COVID-19. However, current drugs such as remdesivir and clinical trials of other chemical compounds have not shown many impressive results. Therefore, it can take more time to provide effective treatment or drugs. In this paper, a deep learning approach based on logistic regression, SVM, Random Forest, and QSAR modeling is suggested. QSAR modeling is done to find the drug targets with protein interaction along with the calculation of binding affinities. Then deep learning models were used for training the molecular descriptor dataset for the robust discovery of drugs and feature extraction for combating COVID-19. Results have shown more significant binding affinities (greater than −18) for many molecules that can be used to block the multiplication of SARS-CoV-2, responsible for COVID-19.

1. Introduction

The first case of COVID-19 was detected in December 2019, and from then, it has overgrown, affecting millions of people around the globe. More than 2 million cases have been confirmed, with over 0.15 million deaths globally [1, 2]. Drug repurposing is defined as discovering and identifying newer applications for existing drugs in the treatment of various diseases [3]. Recent advancements in drug discovery using deep learning have made it possible to speed up identifying and developing new pharmaceuticals [4]. Various drugs, such as Arbidol, remdesivir, and favipiravir, have been tested to cure COVID-19 patients and many others are in the testing phase [4]. Biomedical researchers are investigating drugs for treating the patients, with an attempt to develop a vaccine for preventing the virus [5]. On the other hand, computer scientists have developed early detection models for COVID-19 from CT scans and X-ray images [5]. These techniques are a subset of deep learning and have been applied successfully in various fields [5]. Over the past few years, a significant increase in the quantity of biomedical data has resulted in the emergence of new technologies such as parallel synthesis and HTS (high-throughput screening), to mining large-scale chemical data [6]. Since COVID-19 is transmitted from person to person, electronic devices based on artificial intelligence may play a crucial role in preventing the spread of this virus. With the expansion of the role of health epidemiologists, the pervasiveness of electronic health data has also increased [7]. The increasing availability of electronic health data provides a massive opportunity for healthcare to enhance healthcare for both discoveries and practical applications [7]. For training machine learning algorithms, these data can be used to improve their decision-making in terms of disease prediction [7].

As the increase in the number of cases infected by coronavirus rapidly outnumbered the medical services available in hospitals, a significant burden on healthcare systems was imposed [7]. Because of the limited supply of hospital services and the delay in time for diagnostic test results, it is common for health professionals to provide patients with sufficient medical care. However, since the number of cases tested for coronavirus is growing increasingly day by day, testing is not feasible due to time and cost factors [7]. This paper aims at suggesting a technique based on deep learning which would be helpful in rapidly finding the drugs for combating the pandemic. Deep learning is currently an area that is quickly emerging and constantly expanding. To optimize its performance, it programs computers using data. Using the training data or its previous encounters, it learns the parameters to optimize the computer programs. It can also forecast the future using the data. Deep learning also lets us operate the statistics of the data to construct a mathematical model. The main goal of deep learning is that it learns without any human intervention from the feed data, and it automatically learns from the data (experience) provided and gives us the desired output where it searches the data trends/patterns [8]. Deep learning techniques have achieved greater efficiency in various tasks, including drug development, prediction of properties, and drug target forecasting. As drug development is a complex task, the deep learning approach makes this process faster and cheaper.

The challenges with COVID-19 at present make it necessary to look for some alternatives in medicine or drugs to combat the rise of cases due to COVID-19 infection. One of the significant challenges is the processing delay for the finalization of the drugs for vaccine formulation. However, many pharmaceuticals companies have achieved success to some extent after passing through different trials. Hence, predicting the most probable drugs for the vaccination formulation can speed up vaccine formulation and thus save many human lives. Another challenge is that most of the testing for vaccine formulation is done on a clinical basis where all the drug combinations are tried to get the desired selection of drugs. Still, there is less utilization of computational techniques for the same at present. Thus, there is an hour to look after some alternatives using some machine intelligence techniques to provide some solutions with more accuracy and at a faster note.

Based on the above challenges, the main contributions of the paper are as follows:(1)Deep learning approach based on logistic regression, SVM, and Random Forest along with QSAR modeling is proposed to discover some drugs for the treatment of COVID-19(2)QSAR modeling is done to find the drug targets with protein interaction along with the calculation of binding affinities(3)Deep learning models are used for training the molecular descriptors dataset for the robust discovery of drugs and feature extraction for combating COVID-19

The rest of the article is organized as follows. Section 2 deals with the literature reviewed. Section 3 deals with the significance of work. Section 4 deals with the suggested methodology followed by Section 5, dealing with results, and the paper is concluded in Section 6.

2. Literature Review

Artificial intelligence techniques have been utilized in various areas of drug and vaccine development [9]. This utilization and further advancements are essential for immediately discovering a cure for the current pandemic. Many studies have been done previously, and many are ongoing to find a less complex and easy-to-use technique that would speed up the drug discovery process. In [10], the authors have trained a model based on LSTM (long short-term memory) for reading the SMILE fingerprints of a molecule for predicting IC50, binding to RdRp. The authors in [11] have suggested a B5G framework, which supports the diagnosis of COVID-19 through low latency and 5G. Choi et al. [12] proposed the MT-DTI model for predicting the drugs approved by FDA having solid affinities for the ACE2 receptor with TMPRSS2. The authors in [13] have reviewed all state-of-the-art research studies related to medical imaging and deep learning. Deep learning techniques and feature engineering were compared in order to efficiently diagnose COVID-19 from CT images [14]. Various neural network architectures and generative models such as RNN, autoencoders with adversarial learning, and reinforcement learning are suggested for ligand-based drug discovery [15]. Classification performance of DNN on imbalance compound datasets is explored by applying data balancing techniques in [16]. A novel approach for deep docking large numbers of molecular structures accurately is suggested in [17]. The effects of deep learning in drug design and complimentary tools were reviewed [18].

In [19], a systematic review of the application of deep learning techniques for predicting drug response in cancer cell lines has been done. A QSAR model (quantitative structure-activity relationship) is developed [20], which implements deep learning to predict antiplasmodial activity and cytotoxicity of untested compounds for screening malaria. In [21], the authors have built a multitask DNN model and compared the results with a single-task DNN model. In [22], various machine learning and deep learning algorithms used for drug discovery are reviewed, and their applications were discussed. However, various studies suggest deep learning for drug discovery or detecting COVID-19 lacks proper practical implementation with results. Most studies have just reviewed different deep learning techniques to be used for the development of drugs. This paper will give a practical implementation on various datasets available online with efficient results. Upon analyzing various studies, we found that various studies claim HCS (high content screening) as an efficient technique for screening chemical compounds for discovering drugs. At present, deep learning techniques have been producing faster and efficient results.

The basic idea of the screening process is that the cells are exposed to various compounds, and automated optical microscopy is done to see what happens, creating thriving images of cells. A quantitative and qualitative analysis of the result can be done by using an automated HCS pipeline. HCS branches out from microscopy, and Giuliano et al. first coined the terminology in the 1990s [23]. HCS research can cover several fields, such as discovering drugs that can be defined as a form of cell phenotypic screen. It includes methods of analysis that produce simultaneous readouts of multiple parameters considering cells or cell compounds. In this phase, the screening aspect is an early discovery stage in a series of various steps needed to identify new drugs. It acts as a filter to target potential applicants that can be used for further development. Small molecules classified as a low molecular weight organic compound, e.g., proteins, peptides, or antibodies, can be the substances used for this purpose [24].

3. Significance of the Work

Hospitals are using trial and error techniques for COVID-19 drug discovery [9]. It results in an emergence of virtual screening to discover chemical compounds due to the inefficiency of the lab-based HTS technique (high-throughput screening) [9]. Also, drug discovery and development is a complex and time-consuming process [25]. It is estimated that the preapproval cost of production of new drugs has increased at the rate of 8.5% annually from 802 million USD to 2870 million USD [26, 27]. Finding molecules with the required characteristics is one of the significant challenges in drug discovery. A practical and quality drug needs to be balanced regarding safety and potency against its target and other properties such as ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) and physicochemical properties [25]. This paper aims to increase the speed of discovering new molecules using deep learning, thereby reducing the cost of producing new drugs. Deep learning techniques will help us navigate large chemical spaces to find new chemical compounds [25]. The significance of using deep learning techniques for combating COVID-19 [1] is summarized in Table 1.

4. Suggested Methodology

This section includes a description of the proposed methodology.

4.1. Dataset Preparation and Preprocessing

We have used the combination of the datasets from the sources [2931]. Each of the datasets contains a set of chemical compounds with respective binding activity to a target protein calculated by pIC50 = −log10(IC50) [32]. Preprocessing is done for removing the invalid and replicated compounds. The entries with IC50 measurements with filtered out compounds having suspicious measures are depicted by the “DATA VALIDITY COMMENT” column. For repeated records groups, if the standard deviation (SD) of the activity is found more significant than 1 log unit, then these datasets are deleted from the dataset, and a single entry is kept with the median of the activity [32]. Data preprocessing is one of the significant phases in data mining as it helps in achieving data integrity. Before preprocessing, data cleaning needs to be done as raw data contain abnormalities and errors affecting the results [33]. After preprocessing, conversion of SMILES [34] representations to molecular representations takes place. These are open datasets that contain the binding, ADMET, and functional information for various drugs like bioactive compounds [35]. The database containing the datasets has over 5 million bioactivity measurements for over 1 million compounds and over 5000 target proteins [35].

A minor challenge may occur in data mining algorithms due to variation in range and distribution of every variable in the large datasets due to distance measurements; also, these may contain noisy variables, which makes the learning of the algorithms more difficult [33]. These challenges can be handled by min-max normalization where the value of each variable is adjusted in a uniform range of 0 to 1 [33]. It is given in the following equation: where is the normalised value, Yx is the value of interest, Yminimum is the minimum value, and Ymaximum is the maximum value.

Apart from the dataset, the system used for performing the experiments has UBUNTU 20.04 LTS OS installed with 16 GB RAM and Intel Core i7-8700 processor. The language used for building the model is Python 3.7 with NumPy, pandas, TensorFlow, Bunch, tqdm, Matplotlib, scikit-learn, NVIDIA GPU, CUDA 9.0, Pytorch 0.4.1, Mordred, and RDkit. For evaluating the binding affinities, PyRx is used. We have used the regression model and QSAR techniques as regression models help us define relationships between dependent and independent variables and show the strength of the impact of various independent variables on dependent variables. QSAR helps in maintaining the quantitative structural relationships in molecular predictions.

4.2. Model Development and Evaluation Parameters

As mentioned above, developing a QSAR model can help us in defining the relationship between the chemical structures and their endpoints by using various statistical methods for the construction of predictive models for revealing the origin of bioactivity [36]. Generally, a QSAR model is depicted by the equation of the form that can be utilized or prediction of endpoints or new compounds in terms of time-consuming and cost approaches. In order to derive the global molecular features for the SMILES, some notations are there [36], which are given in the following equation:

Also, these global descriptors are described as follows [36]:(1) is defined as the presence or absence of double (=), triple (#), and stereochemical (@) bond in SMILES(2) is defined as the coincidence of I, N, O, P, S, Br, Cl, F, #, @, and =(3)NOSP is defined as the presence or absence of P, S, O, and N(4)HALO is defined as the presence and absence of halogens

The optimal attributes for the SMILES are calculated by the following equation [36]:

The chemical endpoints [36] can be given in the following equation:where T0 is the intercept and T1 is the correlation coefficient.

The development of the QSAR model consists of two significant steps: (i) describing the molecular structure and (ii) the multivariate analysis for correlation of molecular descriptors with observable characteristics [33]. Successful development of the model also includes data preprocessing and statistical evaluations. For evaluating the performance of the QSAR model, the statistical method suggested in [33] is used in the following equation:where x2 is the cross-validated explained variance, Y2 is the coefficient of determination, and are the predicted vs. observed activities and vice versa, respectively, and x2 is calculated by the following equation:where Pj are the measured values, are the predicted values, and is the mean value of the entire dataset. This equation is also used for the calculation of external x2, i.e., the compounds that are not used in the QSAR model development earlier and are given in the following equation:

For measuring the internal chemical diversity [28], let x and y be two molecules having ZX and ZY as their Morgan fingerprints [28]. The number of common fingerprints is defined as and the total number of fingerprints is defined as . The Tanimoto similarity [28] between x and y is defined in the following equation:

And the Tanimoto distance [28] is given by

We have used RDKit [28] for the implementation of Tanimoto distance. In earlier studies, the QSAR models were developed for small compounds that used limited quantitative characteristics [32]. Various algorithms were suggested for covering significant features, including hundreds or thousands of molecular descriptors. We have used the OPLRAreg algorithm suggested in [32] to illustrate the flexibility of mathematical modeling and show how the division of characteristics and regions helps enhance the features of OSAR datasets. The OPLRAreg is given in Algorithm 1.

(1)OPLRAreg is evaluated for P = 1//linear regression
(2)//choosing implicit features
(3)Fordo//choose the best partition feature in the region
(4)  Evaluate OPLRAreg having two regions and partition feature yf
(5)  Ifthen
(6)   
(7)   
(8)  End if
(9)End for
(10)
(11)Whiledo//increase the number of regions
(12) Evaluate OPLRAreg with P regions and partition feature yf
(13)
(14)
(15)End While
(16)Return yf, Breakpoints as Bq y, and regression coefficients as

Due to advancements in deep learning techniques, there has been an increase in the use of neural networks in a variety of applications including healthcare [25]. A neural network can be defined as a group of layers consisting of perceptrons called multilayer perceptron (MLP) or simply a neuron [25]. The perceptrons are the main building blocks of a perceptron and consist of three parts, weights, , biases, , and an activation function, [25]. Let the input vector given to a perceptron be defined as, Q. Then, the output is given in the following equation:

Both and x should be in the same direction. Furthermore, for enabling the matrix multiplication, b and x1 should be appended to the weight and input vector, respectively [25] so that and .

And the output is given by

Due to an increase in the efficiency of computation, matrix multiplication is required for training larger networks with forward passing and backpropagation for optimizing the network parameters [25]. The different types of classification methods are given in the following sections.

4.2.1. Logistic Regression

Logistic regression is the most used method of modeling for the prediction of risk [37]. A logistic regression model uses a role to render the model range output between zero and one and should therefore be used for classification. The logistic function is defined in [37] as follows:where r is the input and and s are called as model parameters. The output given is the modeled probability of the input belonging to a class [37]. For interpreting the meaning of the weights, rearrange the above equation as follows [37]:

is called as the odds. The modeling of odds is done through a linear equation [37]. Like most of the ML (machine learning) models, optimization of the parameters is done w.r.t. loss function [37]. Consider a given set of data points where pj is defined as the input and is the true output. Let denote the output of the logistic regressor. Then are selected according to [37] in the following equation:

This is also known as the log-loss function. The problem of minimization is solved iteratively until the convergence of parameters, using a coordinate descent algorithm [37].

4.2.2. Random Forest

Random Forest is an ensemble approach that combines several decision trees to make predictions. More reliable and precise predictions can be made by combining several poor learners. In addition, ensemble techniques decrease variance and are less vulnerable to overfitting [37]. The Random Forest algorithm [38] is given in Algorithm 2.

fordo
 generation of random samples
while stopping criteria true do
  select randomly of all features
  training of tree on
 end

As a sequence of questions, a decision tree is best defined. The principle is that questions are asked, and new questions are asked based on the responses, thus creating a tree. Data points are identified using the leaf nodes in the tree [37] by following the trajectory of the questions and answers. The tree is designed by determining which question to ask at each node and determined based on the information obtained from each possible query or the degree to which the uncertainty in the dataset [37] is reduced. The uncertainty in the dataset [37] is defined in the following equation:

The information is acquired by knowing the value of certain feature F and is given in the following equation:where Xz is defined as the subset where the feature F takes z value. Therefore, during the construction of a decision tree, a feature is to decide each node as explained in [37]. Here, the construction is either terminated once the entropy of the subset has reached zero or the tree has reached its maximum depth [37]. Upon evaluation of a sample, the tree's trajectory is decided until the leaf node is reached. An approximate probability can also be given as output by comparing the class sizes found in the leaf node [37].

4.2.3. Support Vector Machine (SVM)

The support vector machine (SVM) is an algorithm for classification that involves creating a hyperplane. A set of features is used in order to classify an object. Thus, the hyperplane will lie in p-dimensional space if there are p features [39]. The hyperplane is generated through SVM optimization, which in turn maximizes the distance from the nearest points, also known as support vectors [39]. Let be an arbitrary observation feature vector in the training set, corresponding label to yj, with a weight vector with and T be the threshold. The constraints defined for the classification problem [39] are given in equations (17) to (20):

Let , then the output of the model can be given as follows:

Instead of using  = 1, for margin maximization, the lower bound on the margin along with the optimization problem can be defined for minimization of [39]. The constraints for the optimization problem can be derived from equations (17) and (18), respectively, [39] as follows:

In some of the cases, it is required to implement a soft margin, allowing some points to lie on the wrong side of the hyperplane [39] in order to provide an efficient model. A cost parameter M is introduced, which plays a major role in the assignment of penalties to errors, where M > 0 [39]. Then, the minimized objective function [39] is defined as follows:. The constraints to the optimization problems [39] are now modified in the following equation:

Most of the datasets are not linearly separable. But through a nonlinear transformation into a high-dimensional space, a dataset is more likely to be linearly separable [37]. Therefore, each sample is transformed using a nonlinear function [37] so that

And then the problem is considered using [37]. Furthermore, using Lagrange optimization, the dual problem of maximizing [37] is defined as follows:subject to the condition

The overall structure of the workflow and QSAR modeling [36, 40] is explained in Figure 1. First, we have to select the number of molecules. It can be of any number. Each molecule has its molecular descriptors that describe the molecules’ physical and chemical properties that help us differentiate between the molecules. Here, 1 and 0 are the binary descriptors that show the presence/absence of the molecular descriptors. A collection of these descriptors constitutes the dataset. Values of X (active/inactive) show the biological activity we want to predict. This dataset is now used for training the deep learning model, which therefore gives our results. The working of the proposed approach is represented in a flowchart, as depicted in Figure 2.

5. Results

Our goal is to develop a deep learning model to suggest novel and effective drugs for combating SARS-CoV-2 or combating COVID-19. Our regression-based models and Random Forest model were trained on a dataset of approximately 1.5 million drug-like molecules from the data sources [2931]. The molecules were represented in Simplified Molecular Input Line Entry System (SMILES) format helping our model learn the required features for designing novel drug-like molecules. SMILES are defined as the character strings for representing drug molecules. For example, an atom of carbon can be represented as C, oxygen atom as O, double bond as =, and CO2 molecule can be represented as C(=O)=O. The maximum length of the string can be taken as 25 [41]. SMILES grammar's learning problem and reproducing it for generating novel small molecules is considered a classification problem [42]. The SMILES strings should be considered a time series, where every symbol is considered a time point. At a given point, the model was trained for predicting the class of the next symbols in the time series.

We will only retrieve the coronavirus proteinase during preprocessing of the bioactivity data that can be reported as IC50 values in nM (nanomolar) units [43]. The data for bioactivity is in the IC50 unit. Compounds with less than 1000 nM values will be considered active, whereas compounds with values greater than 10,000 nM will be considered inactive. As for such values, the intermediate value is between 1,000 and 10,000 nM [43]. To evaluate the model, Lipinski descriptors [43] were used as given in Table 2.

Upon analyzing the pIC50 values, the actives and inactives have shown a significant difference, which is expected as the values of IC  < 1000nM = active, IC50 > 10000nM = inactive, corresponding to pIC50 >6 = active and pIC50  <5  = inactive. Out of the 4 Lipinski descriptors [43], only logP showed no difference between the actives and inactives, while the other three descriptors showed significant differences between the actives and inactives. This can be better understood by Figures 37 , respectively. A scatter plot has also been drawn to show that the two bioactivity classes (active/inactive) are spanning similar chemical spaces.

Figures 37 show that our model can explore the chemical spaces that are further adapted for generating the smaller molecules specific to a target of interest. The SARS-CoV-2 contains the proteins responsible for the cation and replication of the virus [44]. The functioning of the proteins can be stopped by introducing the drug molecules capable of blocking the protein. Therefore, we have to find the molecules with a high binding affinity to bind the protein effectively. Various drugs/compounds have been tested for finding a high binding relationship, but the results are not very good. We have created novel molecules for binding with the coronavirus, using deep learning and QSAR modeling. After the generation of the molecules, PyRx was used for evaluating the binding affinities. We have also build a regression model using a Random Forest algorithm for acetylcholinesterase inhibitors, as shown in Figure 8. The binding affinities for leading drugs for other diseases such as HIV inhibitors range from −10 to −11. Also, the most recent drug remdesivir, which is clinically tested, has the binding affinity of −13. By convention, the more negative the scores are, the more effective the drugs would be. QSAR modeling, docking analysis, and use of regression model generate a list of bioactive compounds from which top 100 compounds were selected, which may have the potential to be effective against SARS-CoV-2. The methodology suggested in this paper is easy to use and can be a possible technique for the discovery of anti-COVID-19 drugs and also shortening the clinical development period required for drug repositioning. Our proposed methodology can give the binding affinity more than the present drugs being tested, making our approach efficient. The proposed list of top 100 chemical structures or molecules generated using our proposed approach through SMILES software is shown in Table 3.

6. Conclusion

Drug development is a time-consuming and expensive process. Deep learning has achieved excellent performance in a lot of tasks. Drug discovery is one of the areas that can be benefitted from this. The use of deep learning techniques has made the process of drug development more manageable and cheaper. Deep learning-based models can learn the feature representations based on present drugs that can be used to explore the chemical spaces in search of more drug-like molecules. The available data for automating the processes and better predictions are what deep learning techniques promise for efficient drug discovery. These techniques have proven effective in scanning peptides or detecting COVID-19 from the CT scan or X-ray images. These techniques can speed up the drug development process but require clinical testing for more validation and accuracy [45].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.