Abstract
Early diagnosis of pandemic diseases such as COVID19 can prove beneficial in dealing with difficult situations and helping radiologists and other experts manage staffing more effectively. The application of deep learning techniques for genetics, microscopy, and drug discovery has created a global impact. It can enhance and speed up the process of medical research and development of vaccines, which is required for pandemics such as COVID19. However, current drugs such as remdesivir and clinical trials of other chemical compounds have not shown many impressive results. Therefore, it can take more time to provide effective treatment or drugs. In this paper, a deep learning approach based on logistic regression, SVM, Random Forest, and QSAR modeling is suggested. QSAR modeling is done to find the drug targets with protein interaction along with the calculation of binding affinities. Then deep learning models were used for training the molecular descriptor dataset for the robust discovery of drugs and feature extraction for combating COVID19. Results have shown more significant binding affinities (greater than −18) for many molecules that can be used to block the multiplication of SARSCoV2, responsible for COVID19.
1. Introduction
The first case of COVID19 was detected in December 2019, and from then, it has overgrown, affecting millions of people around the globe. More than 2 million cases have been confirmed, with over 0.15 million deaths globally [1, 2]. Drug repurposing is defined as discovering and identifying newer applications for existing drugs in the treatment of various diseases [3]. Recent advancements in drug discovery using deep learning have made it possible to speed up identifying and developing new pharmaceuticals [4]. Various drugs, such as Arbidol, remdesivir, and favipiravir, have been tested to cure COVID19 patients and many others are in the testing phase [4]. Biomedical researchers are investigating drugs for treating the patients, with an attempt to develop a vaccine for preventing the virus [5]. On the other hand, computer scientists have developed early detection models for COVID19 from CT scans and Xray images [5]. These techniques are a subset of deep learning and have been applied successfully in various fields [5]. Over the past few years, a significant increase in the quantity of biomedical data has resulted in the emergence of new technologies such as parallel synthesis and HTS (highthroughput screening), to mining largescale chemical data [6]. Since COVID19 is transmitted from person to person, electronic devices based on artificial intelligence may play a crucial role in preventing the spread of this virus. With the expansion of the role of health epidemiologists, the pervasiveness of electronic health data has also increased [7]. The increasing availability of electronic health data provides a massive opportunity for healthcare to enhance healthcare for both discoveries and practical applications [7]. For training machine learning algorithms, these data can be used to improve their decisionmaking in terms of disease prediction [7].
As the increase in the number of cases infected by coronavirus rapidly outnumbered the medical services available in hospitals, a significant burden on healthcare systems was imposed [7]. Because of the limited supply of hospital services and the delay in time for diagnostic test results, it is common for health professionals to provide patients with sufficient medical care. However, since the number of cases tested for coronavirus is growing increasingly day by day, testing is not feasible due to time and cost factors [7]. This paper aims at suggesting a technique based on deep learning which would be helpful in rapidly finding the drugs for combating the pandemic. Deep learning is currently an area that is quickly emerging and constantly expanding. To optimize its performance, it programs computers using data. Using the training data or its previous encounters, it learns the parameters to optimize the computer programs. It can also forecast the future using the data. Deep learning also lets us operate the statistics of the data to construct a mathematical model. The main goal of deep learning is that it learns without any human intervention from the feed data, and it automatically learns from the data (experience) provided and gives us the desired output where it searches the data trends/patterns [8]. Deep learning techniques have achieved greater efficiency in various tasks, including drug development, prediction of properties, and drug target forecasting. As drug development is a complex task, the deep learning approach makes this process faster and cheaper.
The challenges with COVID19 at present make it necessary to look for some alternatives in medicine or drugs to combat the rise of cases due to COVID19 infection. One of the significant challenges is the processing delay for the finalization of the drugs for vaccine formulation. However, many pharmaceuticals companies have achieved success to some extent after passing through different trials. Hence, predicting the most probable drugs for the vaccination formulation can speed up vaccine formulation and thus save many human lives. Another challenge is that most of the testing for vaccine formulation is done on a clinical basis where all the drug combinations are tried to get the desired selection of drugs. Still, there is less utilization of computational techniques for the same at present. Thus, there is an hour to look after some alternatives using some machine intelligence techniques to provide some solutions with more accuracy and at a faster note.
Based on the above challenges, the main contributions of the paper are as follows:(1)Deep learning approach based on logistic regression, SVM, and Random Forest along with QSAR modeling is proposed to discover some drugs for the treatment of COVID19(2)QSAR modeling is done to find the drug targets with protein interaction along with the calculation of binding affinities(3)Deep learning models are used for training the molecular descriptors dataset for the robust discovery of drugs and feature extraction for combating COVID19
The rest of the article is organized as follows. Section 2 deals with the literature reviewed. Section 3 deals with the significance of work. Section 4 deals with the suggested methodology followed by Section 5, dealing with results, and the paper is concluded in Section 6.
2. Literature Review
Artificial intelligence techniques have been utilized in various areas of drug and vaccine development [9]. This utilization and further advancements are essential for immediately discovering a cure for the current pandemic. Many studies have been done previously, and many are ongoing to find a less complex and easytouse technique that would speed up the drug discovery process. In [10], the authors have trained a model based on LSTM (long shortterm memory) for reading the SMILE fingerprints of a molecule for predicting IC50, binding to RdRp. The authors in [11] have suggested a B5G framework, which supports the diagnosis of COVID19 through low latency and 5G. Choi et al. [12] proposed the MTDTI model for predicting the drugs approved by FDA having solid affinities for the ACE2 receptor with TMPRSS2. The authors in [13] have reviewed all stateoftheart research studies related to medical imaging and deep learning. Deep learning techniques and feature engineering were compared in order to efficiently diagnose COVID19 from CT images [14]. Various neural network architectures and generative models such as RNN, autoencoders with adversarial learning, and reinforcement learning are suggested for ligandbased drug discovery [15]. Classification performance of DNN on imbalance compound datasets is explored by applying data balancing techniques in [16]. A novel approach for deep docking large numbers of molecular structures accurately is suggested in [17]. The effects of deep learning in drug design and complimentary tools were reviewed [18].
In [19], a systematic review of the application of deep learning techniques for predicting drug response in cancer cell lines has been done. A QSAR model (quantitative structureactivity relationship) is developed [20], which implements deep learning to predict antiplasmodial activity and cytotoxicity of untested compounds for screening malaria. In [21], the authors have built a multitask DNN model and compared the results with a singletask DNN model. In [22], various machine learning and deep learning algorithms used for drug discovery are reviewed, and their applications were discussed. However, various studies suggest deep learning for drug discovery or detecting COVID19 lacks proper practical implementation with results. Most studies have just reviewed different deep learning techniques to be used for the development of drugs. This paper will give a practical implementation on various datasets available online with efficient results. Upon analyzing various studies, we found that various studies claim HCS (high content screening) as an efficient technique for screening chemical compounds for discovering drugs. At present, deep learning techniques have been producing faster and efficient results.
The basic idea of the screening process is that the cells are exposed to various compounds, and automated optical microscopy is done to see what happens, creating thriving images of cells. A quantitative and qualitative analysis of the result can be done by using an automated HCS pipeline. HCS branches out from microscopy, and Giuliano et al. first coined the terminology in the 1990s [23]. HCS research can cover several fields, such as discovering drugs that can be defined as a form of cell phenotypic screen. It includes methods of analysis that produce simultaneous readouts of multiple parameters considering cells or cell compounds. In this phase, the screening aspect is an early discovery stage in a series of various steps needed to identify new drugs. It acts as a filter to target potential applicants that can be used for further development. Small molecules classified as a low molecular weight organic compound, e.g., proteins, peptides, or antibodies, can be the substances used for this purpose [24].
3. Significance of the Work
Hospitals are using trial and error techniques for COVID19 drug discovery [9]. It results in an emergence of virtual screening to discover chemical compounds due to the inefficiency of the labbased HTS technique (highthroughput screening) [9]. Also, drug discovery and development is a complex and timeconsuming process [25]. It is estimated that the preapproval cost of production of new drugs has increased at the rate of 8.5% annually from 802 million USD to 2870 million USD [26, 27]. Finding molecules with the required characteristics is one of the significant challenges in drug discovery. A practical and quality drug needs to be balanced regarding safety and potency against its target and other properties such as ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) and physicochemical properties [25]. This paper aims to increase the speed of discovering new molecules using deep learning, thereby reducing the cost of producing new drugs. Deep learning techniques will help us navigate large chemical spaces to find new chemical compounds [25]. The significance of using deep learning techniques for combating COVID19 [1] is summarized in Table 1.
4. Suggested Methodology
This section includes a description of the proposed methodology.
4.1. Dataset Preparation and Preprocessing
We have used the combination of the datasets from the sources [29–31]. Each of the datasets contains a set of chemical compounds with respective binding activity to a target protein calculated by pIC50 = −log_{10}(IC_{50}) [32]. Preprocessing is done for removing the invalid and replicated compounds. The entries with IC50 measurements with filtered out compounds having suspicious measures are depicted by the “DATA VALIDITY COMMENT” column. For repeated records groups, if the standard deviation (SD) of the activity is found more significant than 1 log unit, then these datasets are deleted from the dataset, and a single entry is kept with the median of the activity [32]. Data preprocessing is one of the significant phases in data mining as it helps in achieving data integrity. Before preprocessing, data cleaning needs to be done as raw data contain abnormalities and errors affecting the results [33]. After preprocessing, conversion of SMILES [34] representations to molecular representations takes place. These are open datasets that contain the binding, ADMET, and functional information for various drugs like bioactive compounds [35]. The database containing the datasets has over 5 million bioactivity measurements for over 1 million compounds and over 5000 target proteins [35].
A minor challenge may occur in data mining algorithms due to variation in range and distribution of every variable in the large datasets due to distance measurements; also, these may contain noisy variables, which makes the learning of the algorithms more difficult [33]. These challenges can be handled by minmax normalization where the value of each variable is adjusted in a uniform range of 0 to 1 [33]. It is given in the following equation: where is the normalised value, Y_{x} is the value of interest, Y_{minimum} is the minimum value, and Y_{maximum} is the maximum value.
Apart from the dataset, the system used for performing the experiments has UBUNTU 20.04 LTS OS installed with 16 GB RAM and Intel Core i78700 processor. The language used for building the model is Python 3.7 with NumPy, pandas, TensorFlow, Bunch, tqdm, Matplotlib, scikitlearn, NVIDIA GPU, CUDA 9.0, Pytorch 0.4.1, Mordred, and RDkit. For evaluating the binding affinities, PyRx is used. We have used the regression model and QSAR techniques as regression models help us define relationships between dependent and independent variables and show the strength of the impact of various independent variables on dependent variables. QSAR helps in maintaining the quantitative structural relationships in molecular predictions.
4.2. Model Development and Evaluation Parameters
As mentioned above, developing a QSAR model can help us in defining the relationship between the chemical structures and their endpoints by using various statistical methods for the construction of predictive models for revealing the origin of bioactivity [36]. Generally, a QSAR model is depicted by the equation of the form that can be utilized or prediction of endpoints or new compounds in terms of timeconsuming and cost approaches. In order to derive the global molecular features for the SMILES, some notations are there [36], which are given in the following equation:
Also, these global descriptors are described as follows [36]:(1) is defined as the presence or absence of double (=), triple (#), and stereochemical (@) bond in SMILES(2) is defined as the coincidence of I, N, O, P, S, Br, Cl, F, #, @, and =(3)NOSP is defined as the presence or absence of P, S, O, and N(4)HALO is defined as the presence and absence of halogens
The optimal attributes for the SMILES are calculated by the following equation [36]:
The chemical endpoints [36] can be given in the following equation:where T_{0} is the intercept and T_{1} is the correlation coefficient.
The development of the QSAR model consists of two significant steps: (i) describing the molecular structure and (ii) the multivariate analysis for correlation of molecular descriptors with observable characteristics [33]. Successful development of the model also includes data preprocessing and statistical evaluations. For evaluating the performance of the QSAR model, the statistical method suggested in [33] is used in the following equation:where x^{2} is the crossvalidated explained variance, Y^{2} is the coefficient of determination, and are the predicted vs. observed activities and vice versa, respectively, and x^{2} is calculated by the following equation:where P_{j} are the measured values, are the predicted values, and is the mean value of the entire dataset. This equation is also used for the calculation of external x^{2}, i.e., the compounds that are not used in the QSAR model development earlier and are given in the following equation:
For measuring the internal chemical diversity [28], let x and y be two molecules having Z_{X} and Z_{Y} as their Morgan fingerprints [28]. The number of common fingerprints is defined as and the total number of fingerprints is defined as . The Tanimoto similarity [28] between x and y is defined in the following equation:
And the Tanimoto distance [28] is given by
We have used RDKit [28] for the implementation of Tanimoto distance. In earlier studies, the QSAR models were developed for small compounds that used limited quantitative characteristics [32]. Various algorithms were suggested for covering significant features, including hundreds or thousands of molecular descriptors. We have used the OPLRAreg algorithm suggested in [32] to illustrate the flexibility of mathematical modeling and show how the division of characteristics and regions helps enhance the features of OSAR datasets. The OPLRAreg is given in Algorithm 1.

Due to advancements in deep learning techniques, there has been an increase in the use of neural networks in a variety of applications including healthcare [25]. A neural network can be defined as a group of layers consisting of perceptrons called multilayer perceptron (MLP) or simply a neuron [25]. The perceptrons are the main building blocks of a perceptron and consist of three parts, weights, , biases, , and an activation function, [25]. Let the input vector given to a perceptron be defined as, ^{Q}. Then, the output is given in the following equation:
Both and x should be in the same direction. Furthermore, for enabling the matrix multiplication, b and x_{1} should be appended to the weight and input vector, respectively [25] so that and .
And the output is given by
Due to an increase in the efficiency of computation, matrix multiplication is required for training larger networks with forward passing and backpropagation for optimizing the network parameters [25]. The different types of classification methods are given in the following sections.
4.2.1. Logistic Regression
Logistic regression is the most used method of modeling for the prediction of risk [37]. A logistic regression model uses a role to render the model range output between zero and one and should therefore be used for classification. The logistic function is defined in [37] as follows:where r is the input and and s are called as model parameters. The output given is the modeled probability of the input belonging to a class [37]. For interpreting the meaning of the weights, rearrange the above equation as follows [37]:
is called as the odds. The modeling of odds is done through a linear equation [37]. Like most of the ML (machine learning) models, optimization of the parameters is done w.r.t. loss function [37]. Consider a given set of data points where p_{j} is defined as the input and is the true output. Let denote the output of the logistic regressor. Then are selected according to [37] in the following equation:
This is also known as the logloss function. The problem of minimization is solved iteratively until the convergence of parameters, using a coordinate descent algorithm [37].
4.2.2. Random Forest
Random Forest is an ensemble approach that combines several decision trees to make predictions. More reliable and precise predictions can be made by combining several poor learners. In addition, ensemble techniques decrease variance and are less vulnerable to overfitting [37]. The Random Forest algorithm [38] is given in Algorithm 2.

As a sequence of questions, a decision tree is best defined. The principle is that questions are asked, and new questions are asked based on the responses, thus creating a tree. Data points are identified using the leaf nodes in the tree [37] by following the trajectory of the questions and answers. The tree is designed by determining which question to ask at each node and determined based on the information obtained from each possible query or the degree to which the uncertainty in the dataset [37] is reduced. The uncertainty in the dataset [37] is defined in the following equation:
The information is acquired by knowing the value of certain feature F and is given in the following equation:where X_{z} is defined as the subset where the feature F takes z value. Therefore, during the construction of a decision tree, a feature is to decide each node as explained in [37]. Here, the construction is either terminated once the entropy of the subset has reached zero or the tree has reached its maximum depth [37]. Upon evaluation of a sample, the tree's trajectory is decided until the leaf node is reached. An approximate probability can also be given as output by comparing the class sizes found in the leaf node [37].
4.2.3. Support Vector Machine (SVM)
The support vector machine (SVM) is an algorithm for classification that involves creating a hyperplane. A set of features is used in order to classify an object. Thus, the hyperplane will lie in pdimensional space if there are p features [39]. The hyperplane is generated through SVM optimization, which in turn maximizes the distance from the nearest points, also known as support vectors [39]. Let be an arbitrary observation feature vector in the training set, corresponding label to y_{j}, with a weight vector with and T be the threshold. The constraints defined for the classification problem [39] are given in equations (17) to (20):
Let , then the output of the model can be given as follows:
Instead of using = 1, for margin maximization, the lower bound on the margin along with the optimization problem can be defined for minimization of [39]. The constraints for the optimization problem can be derived from equations (17) and (18), respectively, [39] as follows:
In some of the cases, it is required to implement a soft margin, allowing some points to lie on the wrong side of the hyperplane [39] in order to provide an efficient model. A cost parameter M is introduced, which plays a major role in the assignment of penalties to errors, where M > 0 [39]. Then, the minimized objective function [39] is defined as follows:. The constraints to the optimization problems [39] are now modified in the following equation:
Most of the datasets are not linearly separable. But through a nonlinear transformation into a highdimensional space, a dataset is more likely to be linearly separable [37]. Therefore, each sample is transformed using a nonlinear function [37] so that
And then the problem is considered using [37]. Furthermore, using Lagrange optimization, the dual problem of maximizing [37] is defined as follows:subject to the condition
The overall structure of the workflow and QSAR modeling [36, 40] is explained in Figure 1. First, we have to select the number of molecules. It can be of any number. Each molecule has its molecular descriptors that describe the molecules’ physical and chemical properties that help us differentiate between the molecules. Here, 1 and 0 are the binary descriptors that show the presence/absence of the molecular descriptors. A collection of these descriptors constitutes the dataset. Values of X (active/inactive) show the biological activity we want to predict. This dataset is now used for training the deep learning model, which therefore gives our results. The working of the proposed approach is represented in a flowchart, as depicted in Figure 2.
5. Results
Our goal is to develop a deep learning model to suggest novel and effective drugs for combating SARSCoV2 or combating COVID19. Our regressionbased models and Random Forest model were trained on a dataset of approximately 1.5 million druglike molecules from the data sources [29–31]. The molecules were represented in Simplified Molecular Input Line Entry System (SMILES) format helping our model learn the required features for designing novel druglike molecules. SMILES are defined as the character strings for representing drug molecules. For example, an atom of carbon can be represented as C, oxygen atom as O, double bond as =, and CO_{2} molecule can be represented as C(=O)=O. The maximum length of the string can be taken as 25 [41]. SMILES grammar's learning problem and reproducing it for generating novel small molecules is considered a classification problem [42]. The SMILES strings should be considered a time series, where every symbol is considered a time point. At a given point, the model was trained for predicting the class of the next symbols in the time series.
We will only retrieve the coronavirus proteinase during preprocessing of the bioactivity data that can be reported as IC50 values in nM (nanomolar) units [43]. The data for bioactivity is in the IC50 unit. Compounds with less than 1000 nM values will be considered active, whereas compounds with values greater than 10,000 nM will be considered inactive. As for such values, the intermediate value is between 1,000 and 10,000 nM [43]. To evaluate the model, Lipinski descriptors [43] were used as given in Table 2.
Upon analyzing the pIC50 values, the actives and inactives have shown a significant difference, which is expected as the values of IC < 1000nM = active, IC50 > 10000nM = inactive, corresponding to pIC50 >6 = active and pIC50 <5 = inactive. Out of the 4 Lipinski descriptors [43], only logP showed no difference between the actives and inactives, while the other three descriptors showed significant differences between the actives and inactives. This can be better understood by Figures 3–7 , respectively. A scatter plot has also been drawn to show that the two bioactivity classes (active/inactive) are spanning similar chemical spaces.
Figures 3–7 show that our model can explore the chemical spaces that are further adapted for generating the smaller molecules specific to a target of interest. The SARSCoV2 contains the proteins responsible for the cation and replication of the virus [44]. The functioning of the proteins can be stopped by introducing the drug molecules capable of blocking the protein. Therefore, we have to find the molecules with a high binding affinity to bind the protein effectively. Various drugs/compounds have been tested for finding a high binding relationship, but the results are not very good. We have created novel molecules for binding with the coronavirus, using deep learning and QSAR modeling. After the generation of the molecules, PyRx was used for evaluating the binding affinities. We have also build a regression model using a Random Forest algorithm for acetylcholinesterase inhibitors, as shown in Figure 8. The binding affinities for leading drugs for other diseases such as HIV inhibitors range from −10 to −11. Also, the most recent drug remdesivir, which is clinically tested, has the binding affinity of −13. By convention, the more negative the scores are, the more effective the drugs would be. QSAR modeling, docking analysis, and use of regression model generate a list of bioactive compounds from which top 100 compounds were selected, which may have the potential to be effective against SARSCoV2. The methodology suggested in this paper is easy to use and can be a possible technique for the discovery of antiCOVID19 drugs and also shortening the clinical development period required for drug repositioning. Our proposed methodology can give the binding affinity more than the present drugs being tested, making our approach efficient. The proposed list of top 100 chemical structures or molecules generated using our proposed approach through SMILES software is shown in Table 3.
6. Conclusion
Drug development is a timeconsuming and expensive process. Deep learning has achieved excellent performance in a lot of tasks. Drug discovery is one of the areas that can be benefitted from this. The use of deep learning techniques has made the process of drug development more manageable and cheaper. Deep learningbased models can learn the feature representations based on present drugs that can be used to explore the chemical spaces in search of more druglike molecules. The available data for automating the processes and better predictions are what deep learning techniques promise for efficient drug discovery. These techniques have proven effective in scanning peptides or detecting COVID19 from the CT scan or Xray images. These techniques can speed up the drug development process but require clinical testing for more validation and accuracy [45].
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest.