This monograph focuses on chemoinformatics approaches applicable to virtual screening of very large available collections of chemical compounds to identify novel biologically active molecules. The approaches covered in the book rely on cheminformatics concepts such as representation of molecules using multiple descriptors of chemical structures, advanced chemical similarity calculations in multidimensional descriptor spaces, and machine learning and data mining approaches. The focus on extending the experiences accumulated in traditional areas of cheminformatics research such as Quantitative Structure Activity Relationships (QSAR) or chemical similarity searching towards virtual screening make this monograph essential reading for researchers in the area of computer-aided drug discovery.
|Publisher:||Royal Society of Chemistry, The|
|Product dimensions:||6.30(w) x 9.30(h) x 1.00(d)|
About the Author
Alexandre Varnek is Professor in Theoretical Chemistry at the Louis Pasteur University (ULP) France, and Head of the Laboratory of Chemoinformatics, Director of Master Courses on Chemoinformatics at the Faculty of Chemistry, ULP. He has 30 years experience in the fields of molecular modelling and chemoinformatics and more than 80 publications including a monograph. His current research projects include the development of new approaches and software tools for in silico design of new compounds. Alexander Tropsha is Head of the Laboratory for Molecular Modeling, School of Pharmacy at the University of North Carolina, Chapel Hill, USA as well as Professor and Chair, Division of Medicinal Chemistry and Natural Products at the School of Pharmacy. His research interests include Computer-Aided Drug Design, Chemoinformatics, and Structural Bioinformatics. He has authored or co-authored over 110 peer-reviewed research papers and book chapters.
Read an Excerpt
Chemoinformatics Approaches to Virtual Screening
By Alexandre Varnek, Alex Tropsha
The Royal Society of ChemistryCopyright © 2008 Royal Society of Chemistry
All rights reserved.
Fragment Descriptors in SAR/QSAR/QSPR Studies, Molecular Similarity Analysis and in Virtual Screening
IGOR BASKIN AND ALEXANDRE VARNEK
Chemoinformatics is an emerging science that concerns the mixing of chemical information resources to transform data into information, and information into knowledge. It is a branch of theoretical chemistry based on its molecular model, and which uses its own basic concepts, learning approaches and areas of application. Unlike quantum chemistry, which considers molecules as ensemble of electrons and nuclei, or force field molecular mechanics or dynamics simulations based on a classical molecular model ("atoms" and "bonds"), chemoinformatics represents molecules as objects in a chemical space defined by molecular descriptors. Among thousands of descriptors, fragment descriptors occupy a special place. Fragment descriptors represent selected subgraphs of a 2D molecular graph; structure-property approaches use their occurrences in molecules or binary values (0, 1) to indicate their presence or absence in the given graph.
The unique properties of fragment descriptors are related to the fact that (i) any molecular graph invariant (i.e., any molecular descriptor or property) can be uniquely represented as a linear combination of fragment descriptors; (ii) any symmetric similarity measure can be uniquely expressed in terms of fragment descriptors; and (iii) any regression or classification structure–property model can be represented as a linear equation involving fragment descriptors.
An important advantage of fragment descriptors is related to the simplicity of their calculation, storage and interpretation (see review articles). They belong to information-based descriptors, which tend to code the information stored in molecular structures. This contrasts with knowledge-based (or semi-empirical) descriptors derived from consideration of the mechanism of action. Owing to their versatility, fragment descriptors can efficiently be used to build structure–property models, perform similarity search, virtual screening and in silico design of chemical compounds with desired properties.
This chapter reviews fragment descriptors with respect to their use in structure–property studies, similarity search and virtual screening. After a short historical survey, different types of fragment descriptors are considered thoroughly. This is followed by a brief review of the application of fragment descriptors in virtual screening, focusing mostly on filtering, similarity search and direct activity/property assessment using quantitative structure–property models.
1.2 Historical Survey
Among a multitude of descriptors currently used in Structure-Activity Relationships/Quantitative Structure–Activity Relationships/Quantitative Structure-Property Relationships (SAR/QSAR/QSPR) studies, fragment descriptors occupy a special place. Their application as atoms and bonds increments in the framework of additive schemes can be traced back to the 1930–1950s; Vogel, Zahn, Souder, Franklin, Tatevskii, Bernstein, Laidler, Benson and Buss and Allen pioneered this field. Smolenskii was one of the first, in 1964, to apply graph theory to tackle the problem of predictions of the physicochemical properties of organic compounds. Later on, these first additive schemes approaches have gradually evolved into group contribution methods. The latter are closely linked with thermodynamic approaches and, therefore, they are applicable only to a limited number of properties.
The epoch of QSAR (Quantitative Structure-Activity Relationships) studies began in 1963–1964 with two seminal approaches: the σ-ρ-π analysis of Hansch and Fujita and the Free-Wilson method. The former approach involves three types of descriptors related to electronic, steric and hydrophobic characteristics of substituents, whereas the latter considers the substituents themselves as descriptors. Both approaches are confined to strictly congeneric series of compounds. The Free–Wilson method additionally requires all types of substituents to be sufficiently present in the training set. A combination of these two approaches has led to QSAR models involving indicator variables, which indicate the presence of some structural fragments in molecules.
The non-quantitative SAR (Structure–Activity Relationships) models developed in the 1970s by Hiller, Golender and Rosenblit, Piruzyan, Avidon et al., Cramer, Brugger, Stuper and Jurs, and Hodes et al. were inspired by the, at that time, popular artificial intelligence, expert systems, machine learning and pattern recognition paradigms. In those approaches, chemical structures were described by means of indicators of the presence of structural fragments interpreted as topological (or 2D) pharmacophores (biophores, toxophores, etc.) or topological pharmacophobes (biophobes, toxophobes, etc.). Chemical compounds were then classified as active or inactive with respect to certain types of biological activity.
Methodologies based on fragment descriptors in QSAR/QSPR studies are not strictly confined to particular types of properties or compounds. In the 1970s Adamson and coworkers were the first to apply fragment descriptors in multiple linear regression analysis to find correlations with some biological activities, physicochemical properties, and reactivity.
An important class of fragment descriptors, the so-called screens (or structural keys, fingerprints), were also developed in 1970s. As a rule, they represent the bit strings that can effectively be stored and processed by computers. Although their primary role is to provide efficient substructure searching in large chemical structure databases, they can be efficiently used also for similarity searching, clustering large chemical databases, assessing their diversity, as well as for SAR and QSAR modeling.
Another important contribution was made in 1980 by Cramer who invented BC(DEF) parameters obtained by means of factor analysis of the physical properties of 114 organic liquids. These parameters correlate strongly with various physical properties of diverse liquid organic compounds. On the other hand, they could be estimated by linear additive-constitutive models involving fragment descriptors. Thus, a set of QSPR models encompassing numerous physical properties of diverse organic compounds has been developed using only fragment descriptors.
One of the most important developments of the 1980s was the CASE (Computer-Automated Structure Evaluation) program by Klopman et al. This "self-learning artificial intelligent system" can recognize activating and deactivating fragments (biophores and biophobes) with respect to the given biological activity and to use this information to determine the probability that a test chemical is active. This methodology has been successfully applied to predict various types of biological activity: mutagenicity, carcinogenicity, hallucinogenic activity, anticonvulsant activity, inhibitory activity with respect to sparteine monooxygenase, β-adrenergic activity μ-receptor binding (opiate) activity, antibacterial activity, antileukemic activity, etc. Using the multivariate regression technique, CASE can also build quantitative models involving fragment descriptors.
Starting in the early 1990s, various approaches and related software tools based on fragment descriptors have been developed and are listed in several conceptual and mini-review papers. Because of the wide scope and large variety of different approaches and applications in this field, many important ideas were reinvented many times and continue to be reinvented. In this review we try to present a clear state-of-the-art picture in this area.
1.2 Main Characteristics of Fragment Descriptors
In this section different types of fragments are classified with respect to their topology and the level of abstraction of molecular graphs.
1.3.1 Types of Fragments
A tremendous number of various fragments are used in structure-property studies: atoms, bonds, "topological torsions", chains, cycles, atom- and bond-centered fragments, maximum common substructures, line notation (WLN and SMILES) fragments, atom pairs and topological multiplets, substituents and molecular frameworks, basic subgraphs, etc. Their detailed description is given below.
Depending on the application area, two types of values taken by fragment descriptors are considered: binary and integer. Binary values indicate the presence (true, yes, 1) or the absence (false, no, 0) of a given fragment in a structure. They are usually used as screens and elements of fingerprints for chemical database management and virtual screening using similarity-based approaches as well as in SAR studies. Integer values corresponding to the occurrences of fragments in structures are used in QSAR/QSPR modeling.
126.96.36.199 Simple Fixed Types
Disconnected atoms represent the simplest type of fragments. They are used to assess a chemical or biological property P in the framework of an additive scheme based on atomic contributions:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.1)
where ni is the number of atoms of i-type, Ai is corresponding atomic contributions. Usually, the atom types account for not only the type of chemical element but also hybridization, the number of attached hydrogen atoms (for heavy elements), occurrence in some groups or aromatic systems, etc. Nowadays, atom-based methods are used to predict some physicochemical properties and biological activities. Thus, several works have been devoted to assess the octanol–water partition coefficient log P: the ALOGP method by Ghose-Crippen, later modified by Ghose and co-workers, and by Wildman and Crippen, the CHEMICALC-2 method by Suzuki and Kudo, the SMILOGP program by Convard and co-authors, and the XLOGP method by Wang and co-authors. Hou and co-authors used Equation (1.1) to calculate aqueous solubility. The ability of this approach to assess biological activities was demonstrated by Winkler et al.
Chemical bonds are another type of simple fragment. The first bond-based additive schemes, such as those of Zahn, Bernstein and Allen, appeared almost simultaneously with the atom-based ones and dealt, presumably, with predictions of some thermodynamic properties.
"Topological torsions" invented Nilakantan et al. are defined as a linear sequence of four consecutively bonded non-hydrogen atoms. Each atom there is described by the type of corresponding chemical element, the number of attached non-hydrogen atoms and the number of p-electron pairs. Molecular descriptors indicating the presence or absence of topological torsions in chemical structures have been used to perform qualitative predictions of biological activity in structure-activity (SAR) studies. Later on, Kearsley et al. recognized that characterizing atoms by element types can be too specific for similarity searching and, therefore, it does not provide sufficient flexibility for large-scaled virtual screening. To solve this problem, they suggested assigning atoms in the Carhart's atom pairs and Nilakantan's topological torsions to one of seven classes: cations, anions, neutral hydrogen bond donors, neutral hydrogen bond acceptors, polar atoms, hydrophobic atoms and other.
The above-mentioned structural fragments – atoms, bonds and topological torsions – can be regarded as chains of different lengths. Smolenskii suggested using the occurrences of chains in an additive scheme to predict the formation enthalpy of alkanes. For the last four decades, chain fragments have proved to be one of the most popular and useful type of fragment descriptors in QSPR/ QSAR/SAR studies. Fragment descriptors based on enumerating chains in molecular graphs are efficiently used in many popular structure-property and structure-activity programs: CASE and MULTICASE (MultiCASE, MCASE) by Klopman, NASAWIN by Baskin et al, BIBIGON by Kumskov, TRAIL and ISIDA by Solov'ev and Varnek. "Molecular pathways" by Gakh and co-authors, and "molecular walks" by Rücker, represent chains of atoms.
In contrast to chains, cyclic and polycyclic fragments are relatively rarely applied as descriptors in QSAR/QSPR studies. Nevertheless, implicitly cyclicity is accounted for by means of: (i) introducing special "cyclic" and "aromatic" types of atoms and bonds, (ii) "collapsing" the whole cycles and even polycyclic systems into "pharmacophoric" pseudo-atoms and (iii) generating cyclic fragments as a part of large fragments [Maximum Common Substructure (MCS), molecular framework, substituents]. Besides, the cyclic fragments are widely used as screens for chemical database processing.
188.8.131.52 WLN and SMILES Fragments
WLN and SMILES fragments correspond respectively to substrings of the Wiswesser Line Notation or Simplified Molecular Input Line Entry System strings used for encoding the chemical structures. Since simple string operations are much faster than processing of information in connection tables, the use of WLN descriptors was justified in the 1970s when computers were still very slow. At that time Adamson and Bawden published some linear QSAR models based on WLN fragments. They have also applied this kind of descriptor for hierarchical cluster analysis and automatic classification of chemical structures. Qu et al. have developed AES (Advanced Encoding System), a new WLN-based notation encoding chemical information for group contribution methods. Interest in line notation descriptors has not disappeared completely with the advent of powerful computers. Thus, SMILES fragment descriptors are used in the SMILOGP program to predict log P, whereas the recently developed LINGO system for assessing some biophysical properties and intermolecular similarities uses holographic representations of canonical SMILES strings.
184.108.40.206 Atom-centered Fragments
Atom-Centered Fragments (ACF) consist of a single central atom surrounded by one or several shells of atoms separated from the central one by the same topological distance. This type of structural fragments was introduced in the early 1950s by Tatevskii, and then by Benson to predict some physicochemical properties of organic compounds in the framework of additive schemes.
ACF fragments containing only one shell of atoms around the central one (i.e, atom-centered neighborhoods of radius 1) were introduced into chemoinformatics practice in 1971 under the names "atom-centered fragments" and "augmented atoms" by Adamson, who studied their distribution in large chemical databases with the intention of using them as screens in chemical database searching. Hodes used, in SAR studies, both "augmented atoms" and "ganglia augmented atoms" representing ACF fragments with radius 2 and generalized second-shell atoms. Subsequently, ACF fragments with radius 1 were implemented in NASAWIN, TRAIL and ISIDA programs. ACF fragments with arbitrary radius were implemented by Filimonov, Poroikov and co-authors in the PASS program under the name Multilevel Neighborhoods of Atoms (MNA), by Xing and Glen as "tree structured fingerprints", by Bender and Glen as "atom environments" and "circular fingerprints" (Figure 1.1), and by Faulon as "molecular signatures".
Several types of ACF fragments were designed to store local spectral parameters (chemical shifts) in spectroscopy data bases. Thus, Bremser has developed Hierarchically Ordered Spherical Environment (HOSE), a system of substructure codes aimed at characterizing the spherical environment of single atoms and complete ring systems. The codes are generated automatically from 2D graphs and describe structural entities corresponding to chemical shifts. A very similar idea has also been implemented by Dubois et al. in the DARC system based on FREL (Fragment Réduit à un Environment Limité) fragments. Xiao et al. have applied Atom-Centered Multilayer Code (ACMC) fragments for structural and substructural searching in large data-bases of compounds and reactions. An important recent application of ACF fragments concerns target prediction ("target fishing") in chemogenomic data analysis.
Excerpted from Chemoinformatics Approaches to Virtual Screening by Alexandre Varnek, Alex Tropsha. Copyright © 2008 Royal Society of Chemistry. Excerpted by permission of The Royal Society of Chemistry.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
Preface; 1 - Fragment Descriptors in SAR/QSAR/QSPR studies, molecular similarity analysis and in virtual screening; Introduction; Historical survey; Main characteristics of Fragment Descriptors; Types of Fragments; Simple Fixed Types; WLN and SMILES Fragments; Atom-Centered Fragments; Bond-Centered Fragments; Maximum Common Substructures; Atom Pairs and Topological Multiplets; Substituents and Molecular Frameworks; Basic Subgraphs; Mined Subgraphs; Random Subgraphs; Library Subgraphs; Fragments describing supramolecular systems and chemical reactions; Storage of fragments' information; Fragment's Connectivity; Generic Graphs; Labeling Atoms; Application in Virtual Screening and In Silico Design; Filtering; Similarity Search; SAR Classification (Probabilistic) Models; QSAR/QSPR Regression Models; In Silico Design; Limitations of Fragment Descriptors; Conclusion; 2 - Topological Pharmacophores; Introduction; 3D pharmacophore models and descriptors; Topological pharmacophores; Topological pharmacophores from 2D-aligments; Topological pharmacophores from 2D pharmacophore fingerprints; Topological index-based 'pharmacophores'?; Topological pharmacophores from 2D-aligments; Topological pharmacophores from pharmacophore fingerprints; Topological pharmacophore pair fingerprints; Topological pharmacophore triplets; Similarity searching with pharmacophore fingerprints - Technical Issues; Similarity searching with pharmacophore fingerprints - Some Examples; Machine-learning of Topological Pharmacophores from Fingerprints; Topological index-based 'pharmacophores'?; Conclusions; 3 - Pharmacophore-based Virtual Screening in Drug Discovery; Introduction; Virtual Screening Methods; Chemical Feature-based Pharmacophores; The Term "3D Pharmacophore"; Feature Definitions and Pharmacophore Representation; Hydrogen bonding interactions; Lipophilic areas; Aromatic interactions; Charge-transfer interactions; Customization and definition of new features; Current super-positioning techniques for aligning 3D pharmacophores and molecules; Generation and Use of Pharmacophore Models; Ligand-based Pharmacophore Modeling; Structure-based Pharmacophore Modeling; Inclusion of Shape Information; Qualitative vs. Quantitative Pharmacophore Models; Validation of Models for Virtual Screening; Application of Pharmacophore Models in Virtual Screening; Pharmacophore Models as Part of a Multi-Step Screening Approach; Antitarget and ADME(T) Screening Using Pharmacophores; Pharmacophore Models for Activity Profiling and Parallel Virtual Screening; Pharmacophore Method Extensions and Comparisons to Other Virtual Screening Methods; Topological Fingerprints; Shape-based Virtual Screening; Docking Methods; Pharmacophore Constraints Used in Docking; Further Reading; Summary and Conclusion; 4 - Molecular Similarity Analysis in Virtual Screening; Ligand-Based Virtual Screening; Foundations of Molecular Similarity Analysis; Molecular Similarity and Chemical Spaces; Similarity Measures; Activity Landscapes; Analyzing the Nature of Structure-Activity Relationships; Relationships between different SARs; SARs and target-ligand interactions; Qualitative SAR characterization; Quantitative SAR characterization; Implications for molecular similarity analysis and virtual screening; Strengths and Limitations of Similarity Methods; Conclusion and Future Perspectives; 5 - Molecular Field Topology Analysis in drug design and virtual screening; Introduction: local molecular parameters in QSAR, drug design and virtual screening; Supergraph-based QSAR models; Rationale and history; Molecular Field Topology Analysis (MFTA); General principles; Local molecular descriptors: facets of ligand-biotarget interaction; Construction of molecular supergraph; Formation of descriptor matrix; Statistical analysis; Applicability control; From MFTA model to drug design and virtual screening; MFTA models in biotarget and drug action analysis; MFTA models in virtual screening; MFTA-based virtual screening of compound databases; MFTA-based virtual screening of generated structure libraries; Conclusion; 6 - Probabilistic approaches in activity prediction; Introduction; Biological Activity; Dose-Effect Relationships; Experimental Data; Probabilistic Ligand-Based Virtual Screening Methods; Preparation of Training Sets; Creation of Evaluation Sets; Mathematical Approaches; Evaluation of Prediction Accuracy; Single-Targeted vs. Multi-Targeted Virtual Screening; PASS Approach; Biological Activities Predicted by PASS; Chemical Structure Description in PASS; SAR Base; Algorithm of Activity Spectrum Estimation; Interpretation of Prediction Results; Selection of the Most Prospective Compounds; Conclusions; 7 - Fragment-based de novo design of druglike molecules; Introduction ;From Molecules to Fragments; From Fragments to Molecules; Scoring the Design; Conclusions and Outlook; 8 - Early ADME/T predictions: a toy or a tool?; Introduction; Which properties are important for early drug discovery?; Physico-chemical profiling; Lipophilicity; Solubility; Data availability and accuracy; Models; Why models don't work: the challenge of the Applicability Domain; AD based on similarity in the descriptor space; AD based on similarity in the property-based space; How reliable are predictions of physico-chemical properties?; Available Data for ADME/T biological properties; Absorption; Data; Models; Distribution; Data; Models; The usefulness of ADME/T models is limited by available data; Conclusions; 9 - Compound Library Design - Principles and Applications; Introduction to Compound Library Design; Methods for Compound Library Design; Design for Specific Biological Activities; Similarity Guided Design of Targeted Libraries; Diversity Based Design of General Screening Libraries; Pharmacophore Guided Design of Focused Compound Libraries; QSAR Based Targeted Library Design; Protein Structure Based Methods for Compound Library Design; Design for Developability or Drug-likeness; Rule & Alert Based Approaches; QSAR Based ADMET Models; Undesirable Functionality Filters; Design for Multiple Objectives and Targets Simultaneously; Concluding Remarks; 10 - Integrated Chemo- and Bioinformatics Approaches to Virtual Screening; Introduction; Availability of large compound collections for virtual screening; NIH Molecular Libraries Roadmap Initiative and the PubChem database; Other chemical databases in public domain; Structure based virtual screening; Major methodologies; Challenges and limitations of current approaches; The implementation of cheminformatics concepts in structure based virtual screening; Predictive QSAR models as virtual screening tools; Critical Importance of model validation; Applicability domains and QSAR model acceptability criteria; Predictive QSAR modeling workflow; Examples of application; Structure based chemical descriptors of protein ligand interface: the EnTESS method; Derivation of the EnTESS descriptors; Validation of the EnTESS descriptors for binding affinity prediction; Structure based cheminformatics approach to virtual screening: the CoLiBRI method; The representation of three-dimensional active sites in multidimensional chemistry space; The mapping between chemistry spaces of active sites and ligands; Summary and Conclusions