Journal of Chemistry

Volume 2019, Article ID 9858371, 15 pages

https://doi.org/10.1155/2019/9858371

## Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Hansen Solubility Parameters Based on 1D and 2D Molecular Descriptors Computed from SMILES String

Chair and Department of Physical Chemistry, Faculty of Pharmacy, Collegium Medicum of Bydgoszcz, Nicolaus Copernicus University in Toruń, Kurpińskiego 5, 85-950 Bydgoszcz, Poland

Correspondence should be addressed to Tomasz Jeliński; lp.kmu.mc@iksnilej.zsamot

Received 29 October 2018; Revised 12 December 2018; Accepted 17 December 2018; Published 10 January 2019

Academic Editor: Teodorico C. Ramalho

Copyright © 2019 Maciej Przybyłek et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A new method of Hansen solubility parameters (HSPs) prediction was developed by combining the multivariate adaptive regression splines (MARSplines) methodology with a simple multivariable regression involving 1D and 2D PaDEL molecular descriptors. In order to adopt the MARSplines approach to QSPR/QSAR problems, several optimization procedures were proposed and tested. The effectiveness of the obtained models was checked via standard QSPR/QSAR internal validation procedures provided by the QSARINS software and by predicting the solubility classification of polymers and drug-like solid solutes in collections of solvents. By utilizing information derived only from SMILES strings, the obtained models allow for computing all of the three Hansen solubility parameters including dispersion, polarization, and hydrogen bonding. Although several descriptors are required for proper parameters estimation, the proposed procedure is simple and straightforward and does not require a molecular geometry optimization. The obtained HSP values are highly correlated with experimental data, and their application for solving solubility problems leads to essentially the same quality as for the original parameters. Based on provided models, it is possible to characterize any solvent and liquid solute for which HSP data are unavailable.

#### 1. Introduction

Modeling of physicochemical properties of multicomponent systems, as, for example, solubility and miscibility, requires information about the nature of interactions between the components. A comprehensive and general characteristics of intermolecular interactions was introduced in 1936 by Hildebrandt [1]. This approach is based on the analysis of solubility parameters *δ* defined as the square root of the cohesive energy density, which can be estimated directly from enthalpy of vaporization, , and molar volume (Eq. (1)):

Since the cohesive energy is the energy amount necessary for releasing the molecules’ volume unit from its surroundings, the solubility parameter can be used as a measure of the affinity between compounds in solution. In his historical doctoral thesis [2], Hansen presented a concept of decomposition of the solubility parameter into dispersion (*d*), polarity (*p*), and hydrogen bonding (HB) parts, which enables a much better description of intermolecular interactions and broad usability [3, 4]. By calculating the Euclidean distance between two points in the Hansen space, one can evaluate the miscibility of two substances according to the commonly known rule “*similia similibus solvuntur*.” There are many scientific and industrial fields of Hansen solubility parameters application, including polymer materials, paints, and coatings (e.g., miscibility and solubility [5–9], environmental stress cracking [10, 11], adhesion [12], plasticizers compatibility [13], swelling, solvent diffusion, and permeation [14, 15], and polymer sensors designing [16], pigments and nanomaterials dispersibility [3, 17–20]), membrane filtration techniques [21], and pharmaceutics and pharmaceutical technology (e.g., solubility [22–27], cocrystal screening [28, 29], drug-DNA interaction [30], drug’s absorption site prediction [31], skin permeation [32], drug-nail affinity [33], drug-polymer miscibility, and hot-melt extrusion technology [34–37]).

Due to the high usability of HSP, many experimental and theoretical methods of determining these parameters were proposed. For example, HSP can be calculated utilizing the equation of state [38] derived from statistical thermodynamics. Alternatively, models taking advantage of the additivity concept, such as the group contribution method (GC) [25, 39–41] is probably the most popular one. Despite the simplicity and success of these approaches, there are some important limitations. First of all, the definition of groups is ambiguous which leads to different parameterization provided by different authors [39]. Besides, the same formal group type can have varying properties, depending on the neighborhood and intramolecular context. As an alternative, molecular dynamics simulations were used for HSP values determination [16, 42–44] even in such complex systems as polymers. Interestingly, quantum-chemical computations were rarely used for predicting HSP parameters. However, the method combining COSMO-RS sigma moments and artificial neural networks (ANN) methodology [45] deserves special attention. Noteworthy, much better results were obtained using ANN than using the linear combination of sigma moments [45].

The application of nonlinear models is a promising way of HSP modeling. In recent times, there has been a significant growth of interest in developing QSPR/QSAR models utilizing nonlinear methodologies, like support vector machine [46–50] and ANN [51–55] algorithms. The attractiveness of these methods lies in their universality and accuracy. However, many are characterized by complex architectures and nonanalytical solutions. An interesting exception is the multivariate adaptive regression splines (MARSplines) [56]. This method has been applied for solving several QSPR and QSAR problems including crystallinity [57], inhibitory activity [58, 59], antitumor activity [60], antiplasmodial activity [61], retention indices [62], bioconcentration factors [63], or blood-brain barrier passage [64]. Interestingly, some studies suggested a higher accuracy of MARSplines when compared to ANN [57, 58, 65]. An interesting approach is the combination of MARSplines with other regression methods. As shown in the research on blood-brain barrier passage modeling, the combination of MARSplines and stepwise partial least squares (PLS) or multiple linear regression (MLR) gave better results than pure models [64]. The MARSplines model for a dependent (outcome) variable *y* and *M* + 1 terms (including intercept) can be summarized by the following equation:where summation is over *M* terms in the model, while *F*_{0} and *F*_{m} are the model parameters. The input variables of the model are the predictors (the *k*th predictor of the *m*th product). The function *H* is defined as a product of basis functions (*h*):where *x* represents two-sided truncated functions of the predictors at point termed knots. This point splits distinct regions for which one of the formula is taken, (*t* − *x*) or (*x* − *t*); otherwise, the respective function is set to zero. The values of knots are determined from the modeled data.

Since nonparametric models are usually adaptive and with a high degree of flexibility, they can very often result in overfitting of the problem. This can lead to poor performance of new observations, even in the case of excellent predictions of the training data. Such inherent lack of generalizations is also characteristic for the MARSplines approach. Hence, additionally to the pruning technique used for limiting the complexity of the obtained model by reducing the number of basis functions, it is also necessary to augment the analysis with the physical meaning of obtained solutions.

The purpose of this study is to test the applicability of the MARSplines approach for determining Hansen solubility parameters and to verify the usefulness of the obtained models by solubility predictions. Hence, an in-depth exploration was performed, including resizing of the models combined with a normalization and orthogonalization of both factors and descriptors. Also, a comparison with the traditional multivariable regression QSPR approach was undertaken. Finally, the obtained models were used for solving typical tasks for which Hansen solubility parameters can be applied, in order to document their reliability and applicability.

#### 2. Methods

##### 2.1. Data Set and Descriptors

In this paper, the data set of experimental HSP collected by Járvás et al. [45] was used for QSPR models generation. This diverse collection comprises a wide range of nonpolar, polar, and ionic compounds including hydrocarbons (e.g., hexane, benzene, toluene, and styrene), alcohols (e.g., methanol, 2-methyl-2-propanol, glycerol, sorbitol, and benzylalcohol), aldehydes and ketones (e.g., benzaldehyde, butanone, methylisoamylketone, and diisobutylketone), carboxylic acids (e.g., acetic acid, acrylic acid, benzoic acid, and citric acid), esters (isoamyl acetate, propylene carbonate, and butyl lactate), amides (*N*,*N*-dimethylformamide, formamide, and niacinamide), halogenated hydrocarbons (e.g., dichloromethane, 1-chlorobutane, chlorobenzene, 1-bromonaphthalene), ionic liquids, and salts (e.g., [bmim]PF6, [bmim]Cl, sodium salts of benzoic acid, *p*-aminobenzoic acid, and diclofenac). These data were obtained from the original HSP database [39, 66] and several other reports [67, 68]. After removing the repeating cases from the original collection, a set of 130 compounds, for which experimental data of HSP are available, was used.

Using information encoded in canonical SMILES, PaDEL software [69] offers 1444 descriptors of both 1D and 2D types. Not all of them can be used in modeling, and those descriptors which are not computable for all compounds or with zero variance were rejected from further analysis. The remaining 886 parameters were used for models definition.

##### 2.2. Computational Protocol

Model building was conducted using absolute values of descriptors or orthogonalized data. Since there are different criteria for selecting independent variables from the pool of mutually related ones, two specific criteria were applied. The first one relied on the direct correlation with modeled HSP data if *R*^{2} > 0.01. The second one used ranking offered by Statistica [70], tailored for regression analysis. These parameters were considered as nonorthogonal ones for which the Spearman correlation coefficient was higher than 0.7 (*R*^{2} > 0.49). These different methods of orthogonalization led to different sets of descriptors used during application of QSPR or MARSplines approaches. Types of performed computations are summarized on Scheme 1.