Nuclear receptors (NRs) are important biological macromolecular transcription factors that are implicated in multiple biological pathways and may interact with other xenobiotics that are endocrine disruptors present in the environment. Examples of important NRs include the androgen receptor (AR), estrogen receptors (ER), and the pregnane X receptor (PXR). In this study we have utilized the Ligand Activity by Surface Similarity Order (LASSO) method, a ligand-based virtual screening strategy to derive structural (surface/shape) molecular features used to generate predictive models of biomolecular activity for AR, ER, and PXR. For PXR, twenty-five models were built using between 8 to 128 agonists and tested using 3000, 8000, and 24,000 drug-like decoys including PXR inactive compounds . Preliminary studies with AR and ER using LASSO suggested the utility of this approach with 2-fold enrichment factors at 20%. We found that models with 64–128 PXR actives provided enrichment factors of 10-fold (10% actives in the top 1% of compounds screened). The LASSO models for AR and ER have been deployed and are freely available online, and they represent a ligand-based prediction method for putative NR activity of compounds in this database.

1. Introduction

The nuclear receptor (NRs) family of transcription factors are important targets for therapeutic interventions for multiple diseases [1] and also may interact with other xenobiotics that are endocrine disruptors present in the environment [2]. It is therefore important to identify compounds that may specifically bind NRs and act as endocrine disruptors and develop synthetic compounds that can selectively (in a cell-type and/or tissue-selective manner) modulate NR pharmacology (reviewed in [39]). NRs including the androgen receptor (AR; NR3C4), estrogen receptors α and β, (ERα and ERβ; NR3A1 and NR3A2) and pregnane X receptor (PXR; NR1I2) are particularly important as both therapeutic targets and for xenobiotics to mediate off-target effects.

The ERs are activated by 17β-estradiol while the AR is activated by testosterone and dihydrotestosterone and these receptors are transcriptional regulators of many genes [10] with important physiological functions [1116]. The human PXR [1719] similarly transcriptionally regulates genes involved in xenobiotic metabolism and excretion, as well as other cellular processes, including apoptosis [2024]. Human PXR is a broad specificity NR, binding a wide variety of molecules [25] and the activation of this NR can cause drug-drug interactions [23].

Multiple QSAR and machine learning models have been described for these NRs, to address endocrine disruptor risk assessment [2628] and toxicological screening [29]. For example, a recent QSAR analysis of 74 natural or synthetic estrogens provided information on structural features for the activation of ERα and ERβ [30]. Nonlinear statistical machine learning methods have been applied to separate NR activators from nonactivators [31]. A virtual screening protocol identified ERβ specific ligands from a plant product-based database [32]; from 12 candidates evaluated by a fluorescence polarization binding assay, 3 had >100-fold selectivity to ERβ over ERα. The same approach has also been used to find compounds with good selectivity for ERα over ERβ [32, 33]. Bisson et al. have used computational methods that led to a nonsteroidal antiandrogen with improved AR antagonistic activity based on an initial screening of FDA approved new drugs [33]. Several groups have published datasets or performed modeling on ER and AR and these data are readily available for further evaluation with new modeling methods [3438].

While the crystal structures of human PXR [3943] have led to a greater understanding of the ligand binding domain (LBD) and ligand-receptor interactions [3945], ligand-based computational models possess the key features for predicting binding [4649]. PXR pharmacophores have been used to predict interactions for antibiotics [50] verified in vitro, and machine learning methods have also been evaluated [25, 38, 5153]. Several protein-based docking studies have also been used to predict PXR agonists [25, 5456], although machine learning methods appear to be advantageous to date.

We have recently described troubleshooting various computational methods [57] and specifically compared different methods for PXR [25]. There is a continuous search for new methods that might offer advantages for computational modeling to overcome some of these limitations and specifically for NRs [58]. A ligand-based software called LASSO (Ligand Activity by Surface Similarity Order) has been described that is focused on similarities in biomolecular activity rather than structural similarity [59]. The key components describing LASSO are the 23 kinds of Interacting Surface Point Type (ISPT) molecular descriptors (see Supplemental Table 1 available online at http://dx.doi.org/10.1155/2013/513537), which capture the essence of the surface point information in a feature vector containing the counts of each surface point type and create the feature vector for that ligand. This vector serves as the descriptor of that molecule with the assumption that ligands with similar feature vectors will have similar activity. A key property of the LASSO descriptor is its conformation independence which is due to the fact that it is defined by the number and type of Interacting Surface Points and not by their relative spacial distribution. LASSO has been shown to be able to readily screen over 1 million structures/minute, identify active molecules by enriching screened databases, and provide a means for scaffold hopping [59]. The current study applies LASSO to various NR datasets to generate models, validate them, and make the models available on a public website to illustrate how the method can be used. This work can be considered an extension of our previous troubleshooting studies [25, 57].

2. Materials and Methods

2.1. Training and Test Set Molecular Dataset Selection

One of the goals of this study was to determine what level of enrichment for binders (at weak or strong binding threshold) can be afforded using the ChemSpider LASSO descriptors (ligand-based approach) and compared with enrichment from a structure-based docking approach (eHiTS). For AR, the dataset consisting of 203 molecules with relative binding affinities and activity threshold classes of (a) strong, (b) moderate, (c) weak, and (d) inactive/nonbinding ligands, we evaluated the ability of LASSO to differentiate both (a) strong and (a + b) strong and moderate binders (all others were considered to be nonbinding). The training set for AR, derived with the LASSO descriptors, was obtained from the DUD set [60] and differ considerably from the test set. To evaluate the LASSO descriptors for the ER dataset consisting of 50 molecules with 15 “hits” (i.e., considerably weak binders) and 35 “nonhits” for the estrogen binding that differs considerably from the training set obtained again, we used the DUD ER (default or agonist and antagonist) as a training set [60].

In addition to the ChemSpider LASSO approach for the AR test set we used the eHiTS structure-based (molecular docking) screening strategy on two conformations of the AR (using PDB structures 2AMA and 1XNN) and reported the minimum score across the two conformations examined (this approach was used to add flexibility to the receptor). Similarly, for the ER dataset we docked against two functionally distinct conformations of the estrogen receptor (3ERT and 1GWR).

2.2. Datasets for LASSO Modeling: Structure File Preparation

The rat ER binding dataset ( values for 50 compounds of environmental relevance [35]) was obtained from EPA’s DSSTox database (http://epa.gov/ncct/dsstox/ [34]). This dataset contains 15 industrial chemical “binders” (i.e., nontherapeutic) with significantly weaker binding affinities than what would be desired for drug lead candidates (i.e., 3–5-fold weaker binding affinity than the natural ligand 17β-estradiol). Similarly, the NCTR’s rat AR activity dataset (competitive inhibition assays), also used in this study, contains 146 AR binders and 56 nonbinders (http://www.fda.gov/nctr/science/centers/toxicoinformatics/edkb/index.htm [37, 38]). All structures were imported into MOE and geometry optimized using the MMFFx forcefield in MOE (Chemical Computing Group, Montreal, Canada).

Three human PXR datasets were used, namely, dataset 1 represented 80 actives μM and 64 inactives μM that were drug-like molecules. The SMILES string for each molecule named or CAS number provided was obtained by downloading from either PubChem (http://pubchem.ncbi.nlm.nih.gov/) or ChemSpider (http://www.chemspider.com/) or sketched using the BUILDER module of SYBYL [56]. Dataset 2 represented 93 actives and 75 inactives that were drug-like molecules from a previous study [61]. The molecular structures encoded as SMILES strings [62] were downloaded from the supplementary information tables in the original publication [61]. Dataset 3 represented 30 actives and 89 inactives from a dataset of steroidal compounds (namely, androstanes, estratrienes, pregnanes and bile salts) as well as the ligands used in the crystal structures with hPXR activation determined by a luciferase-based reporter assay [25]. Human PXR activation was determined by a luciferase-based reporter assay as has been previously described in these and other publications.

2.3. LASSO Models for ER and AR

The methodology of LASSO has already been previously described in detail, [59] and the method performance in terms of diversity of test set and % enrichment of a database has also already been evaluated for the DUD set in paper just mentioned and also for other targets published elsewhere (http://www.simbiosys.com/ehits_lasso/ehits_lasso_table.html) to examine the performance of eHiTS LASSO, with this endocrine panel subset of target proteins of the total ~48 nuclear receptors. We used the newly assembled directory of useful decoys (DUD) [60] dataset to augment both the KIERBL and NCTR AR datasets.

2.4. PXR Models: Method I

The previously mentioned three PXR datasets were received from three different sources described earlier. Set 1, called: “hpxr_test,” contained 80 actives and 64 inactives or decoys; set 2, called: “hpxr_train,” contained 93 actives and 75 inactives; and finally set 3, called: “PXR119-class,” contained 30 actives with 89 inactives. Out of these three data sets, 7 screening prediction models were built using only the actives (the inactives were automatically generated by the software).

The following models were developed. Model 1 was trained on the first dataset (hpxr_test, 80 ligands) and tested with the other two sets (123 ligands). Model 2 was trained on the second dataset (hpxr_train, 93 ligands) and tested with the other two sets (110 ligands). Model 3 was trained on the third dataset (PXR119-class, 30 ligands) and tested with the other two sets (173 ligands). Models 4–6 were trained on sets 1 and 2, that is, 173 ligands, sets 1 and 3, that is, 110 ligands, and sets 2 and 3, that is, 123 ligands and tested with the remaining one set of actives (i.e., 30, 93, and 80 ligands, resp.). Model 7 was trained on all actives (1, 2, and 3) and tested on the same. This was done as an extreme case to see the maximum potential training effect.

2.5. PXR Models: Method II

A second method for creating LASSO prediction models for the PXR test case was also investigated. Actives from the 3 datasets were all merged, resulting in an SDF file with 203 ligands with relative binding affinities and activity threshold classes (a) strong, (b) moderate, (c) weak, and (d) inactive/nonbinding ligands. To determine how many actives are needed to be selected for a good LASSO prediction model and also to see if the source of the actives is important, 25 LASSO models were developed (Supplementary Table 2). Prediction models were built by selecting 8, 16, 32, 64, and 128 actives, starting from positions first, ninth, seventeenth, thirty-third, and sixty-fourth in the merged actives file.

The above 25 models were then tested for enrichment factor using the actives from the total active set and leaving out the ones used for training (this was 8, 16, 32, 64 or 128 ligands, resp.) mixed with drug-like decoys, that were obtained from another recent screening study [63]. To assess the effect of the size of the decoy set upon the prediction model, random 3000 (3 k), 8000 (8 k), and the whole 24,000 (24 k) decoy sets were used. In each case the decoys from all three sets received (228 structures in total) were added into the decoy test set.

3. Results

The enrichment plots shown for AR (Figure 1) and ER (Figure 2) with the percent actives recovered versus percent of dataset reveal an enrichment of ~2-fold at 20% of the dataset coverage regardless of whether a ligand or structure-based approach was used. For the AR dataset if the interaction threshold is specified as strong or strong + moderate, different levels of enrichment are incurred by either ChemSpider LASSO or eHiTS results. This translates into an improved performance of either ligand or structure-based screening approaches to bin molecules with stronger interactions (cyan and purple) than those substantially altered through the addition of weaker binding classes (magenta and yellow). Interestingly, in terms of the early-recognition problem, eHiTS is more sensitive (4-fold at 20%) than LASSO (1.5-fold at 20%); however, all 15 actives are captured by the LASSO descriptor within the first 37% of the dataset (with a minimum value of LASSO = 0.07) at considerably lower computational cost. A means of incorporating this into a real scenario would be to screen ChemSpider for AR with a descriptor above a threshold (in this case 0.07) from a specific dataset on ChemSpider and follow up these “hits” only with a more costly structure-based approach.

For the ER dataset where all 15 binders are in fact weak binders (i.e., 3 to 5 orders of magnitude weaker binders than the natural ligand 17β-estradiol) the default LASSO descriptors outperform (3-fold enrichment) the structure-based approach (2-fold enrichment) and the agonist-trained LASSO method outperforms the antagonist LASSO method (most likely due to a large diversity among antagonists than agonists). Here we can see that even for weakly interacting partners (i.e., low affinity binders for ER) we can still obtain enrichment that is substantially better than random.

These tandem virtual screening approaches combine computationally efficient ligand-based ChemSpider LASSO descriptors (since ChemSpider is at its core a rich and diverse collection of chemical structures, these were used in order to produce LASSO predictions for over 14 million compounds against a series of 40 targets including AR and ER. A LASSO search feature was added to ChemSpider to allow users to search the database by LASSO value (see Figure 3(a)). Scientists can readily search for the top 1000 compounds (or less) with the highest LASSO value for a particular target of interest. An advanced search in ChemSpider can combine LASSO value searches with other parameters such as molecular weight, rule of 5 values, and specific data sources (e.g., selecting molecules from commercial data sources only)) prior to more costly structure-based virtual screening strategies, dramatically improving virtual screening and “early-recognition problem” workflow efficiency.

Piggy-backing more costly structure-based virtual screening strategies on top of an initial screen dramatically assists in virtual screening endeavors and the early-recognition problem.

We have also shown an example of a molecule, mibolerone, a strong AR and ER binder based on LASSO (Figure 3(b)) which is known as a potent AR binder [64]. The LASSO surface point type values are shown in Supplemental Table 3 and more visually in Supplemental Figure 2.

When we used LASSO models with hPXR in method I (Figure 4) we found the best results with Model 1 which suggested 40% of the ligands can be pushed into the top 10% of the screened database resulting with an enrichment factor 4-fold better than random (Figure 5). In Method II we found the same enrichment factor using 64 actives in a 24 thousand compound decoy set (Figure 6). Another way to evaluate the models is to present the statistics for using dataset 1 to predict dataset 2 for which we obtained sensitivity 12%, specificity 99%, accuracy 51%, and Matthews correlation 0.2. Using dataset 2 to predict dataset 1 gives similar results. These results suggest that the models could identify potential human PXR agonists in databases similar to other target proteins [59].

4. Discussion

For both the AR and ER ligands the main objective was to see how ligand-based screening tools, such as LASSO’s ChemSpider implementation, perform such that they could be used for prioritizing chemicals for testing. The AR dataset contained a mixture of drug-like and environmental receptor modulators, whereas the ER dataset contained primarily environmental chemicals. Even in light of the relatively weak binding affinity of the “actives,” that is, of 10−4–10−6, while these would be poor candidates for lead optimization into drugs, they still pose an interaction potential with biological systems such as NRs if they bioaccumulate. Using these leads from LASSO screening with other methods such as molecular docking or free-energy perturbation simulations may also be useful. The validation of the approaches outlined above was pursued by examining two real datasets. These were the FDA’s NCTR AR [38] and 50 environmental molecules evaluated for ER binding affinity [35]. In addition we have used multiple sets of PXR agonists described previously. Our results show enrichments of between 2-fold and 4-fold depending on the NR. For PXR there have been numerous recent studies using different machine learning methods and descriptors [25, 54, 56], and while the Matthews correlation coefficient in this study is lower than those in previous studies, the level of enrichment from between 4 fold (40% of actives in the top 10% of compounds screened) and 10-fold (10% actives in the top 1% of compounds screened) was very encouraging.

The molecular descriptors used in eHiTS LASSO are independent of ligand conformation and have been shown to successfully enrich screened databases across a wide range of target families [59]. Lying somewhere between a 2D and a 3D descriptor the ISPT descriptor does not contain any shape or 2D connectivity information. There may however be some molecular size information implicit in the descriptor due to capturing the counts of surface points and larger molecules will have more surface points than smaller molecules (and eHiTS LASSO may be somewhat sensitive to this).

The relatively high speed of eHiTS LASSO on a single CPU [59] makes it an ideal tool to be used as a predocking screen. From a troubleshooting perspective, eHiTS LASSO will return a high percentage of false positives, due to not considering 3D relationships of surface properties. Because of this, it will also return a higher percentage of different scaffolds, enabling scaffold hopping. It is also important to note that LASSO would not be able to differentiate stereoisomerism apart from, perhaps, diastereomeric pairs which have structurally (configurationally) different features rather than conformationally different features, for which this method is conformation invariant.

Taking the results of eHiTS LASSO and feeding the top into a docking program would allow the docking program to weed out many of the false positives binders. For this reason, eHiTS LASSO is currently integrated with the commercially available eHiTS docking tool and can be readily used as a predocking screening tool for large virtual screens.

From the current study we have shown significant enrichments when testing computational models for AR, ER, and PXR. While AR and ER predictions are currently already implemented in ChemSpider, it is clear that adding predicted values for PXR and other NRs as they become available would be beneficial to the community in terms of accessing an open source of chemical structures with pregenerated descriptors. It should be noted however that the generation of model data for a database as large as that hosted by ChemSpider (now well over 25 million compounds) is not a small undertaking and consumes a significant amount of compute time, data preparation, and handling in order to deliver the models to the community for consumption.

The use of such ligand-based computational methods as exemplified by LASSO in this study could also be useful for the design and selection of chemical products that are less hazardous to human health and the environment. This may make them useful in green chemistry [65] (http://www.epa.gov/gcc/pubs/about_gc.html) as well as in biomedical research. The ready accessibility of such NR binding predictions from computational models like LASSO will be key in future for both pharmaceutical and environmental applications, and databases like ChemSpider can have an important role in providing them to the public as a predocking criteria, as we have demonstrated in this study. This study used published rat ER, AR and human PXR data. LASSO could also be applied to build models for the same NRs across multiple species, such that they could be used to estimate interspecies variation in ligand binding.


AR:Androgen  receptor
DUD:Directory  of  useful  decoys
ER:Estrogen  receptor
ISPT:Interacting  Surface  Point  Type
LASSO:Ligand  Activity  by  Surface  Similarity  Order
LBD:Ligand  binding  domain
PXR:Pregnane  X  receptor
QSAR:Quantitative  Structure  Activity  Relationship
SNNS:Stuttgart  Neural  Network  Simulator.

Supporting Information

The supplemental files contain (I) the 23 Surface Point Types used in LASSO with related descriptions, (II) the model building details for PXR (III) the LASSO 6.1 surface point types for Mibolerone, (IV) a visualization of the generalized surface-point types from LASSO for a histidine-like fragment as visualized in CheVi and (V) Mibolerone displayed in SimBioSys’ CheVi 3D desktop visualization tool, showing the 3D structure, color-coded interaction surface of the molecule, and the surface point representation.


This document has been subjected to review by the US Environmental Protection Agency and approved for publication.

Conflict of Interests

A. J. Williams is employed by the Royal Society of Chemistry which owns ChemSpider and associated technologies. S. Ekins and M. R. Goldsmith were on the advisory board for ChemSpider from June 2007 until May 2011. A. Simon, Z. Zsoldos, and O. Ravitz are employed by SimBioSys Inc. which owns LASSO and eHiTS.

Supplementary Materials

The supplemental files contain (I) the 23 Surface Point Types used in LASSO with related descriptions, (II) the model building details for PXR (III) the LASSO 6.1 surface point types for Mibolerone, (IV) a visualization of the generalized surface-point types fromLASSO for a histidine-like fragment as visualized in CheVi and (V) Mibolerone displayed in SimBioSys’ CheVi 3D desktop visualization tool, showing the 3D structure, color-coded interaction surface of themolecule, and the surface point representation.

  1. Supplementary Material