Chemoinformatics Approaches to Virtual Screeningby Alexandre Varnek (Editor), Weifan Zheng (Contribution by), Alex Tropsha (Editor), Stephen R Johnson (Contribution by), Igor Baskin (Contribution by)
Chemoinformatics is broadly a scientific discipline encompassing the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information. It is distinct from other computational molecular modeling approaches in that it uses unique representations of chemical structures in the form of multiple chemical
Chemoinformatics is broadly a scientific discipline encompassing the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information. It is distinct from other computational molecular modeling approaches in that it uses unique representations of chemical structures in the form of multiple chemical descriptors; has its own metrics for defining similarity and diversity of chemical compound libraries; and applies a wide array of statistical, data mining and machine learning techniques to very large collections of chemical compounds in order to establish robust relationships between chemical structure and its physical or biological properties. Chemoinformatics addresses a broad range of problems in chemistry and biology; however, the most commonly known applications of chemoinformatics approaches have been arguably in the area of drug discovery where chemoinformatics tools have played a central role in the analysis and interpretation of structure-property data collected by the means of modern high throughput screening. Early stages in modern drug discovery often involved screening small molecules for their effects on a selected protein target or a model of a biological pathway. In the past fifteen years, innovative technologies that enable rapid synthesis and high throughput screening of large libraries of compounds have been adopted in almost all major pharmaceutical and biotech companies. As a result, there has been a huge increase in the number of compounds available on a routine basis to quickly screen for novel drug candidates against new targets/pathways. In contrast, such technologies have rarely become available to the academic research community, thus limiting its ability to conduct large scale chemical genetics or chemical genomics research. However, the landscape of publicly available experimental data collection methods for chemoinformatics has changed dramatically in very recent years. The term "virtual screening" is commonly associated with methodologies that rely on the explicit knowledge of three-dimensional structure of the target protein to identify potential bioactive compounds. Traditional docking protocols and scoring functions rely on explicitly defined three dimensional coordinates and standard definitions of atom types of both receptors and ligands. Albeit reasonably accurate in many cases, conventional structure based virtual screening approaches are relatively computationally inefficient, which has precluded them from screening really large compound collections. Significant progress has been achieved over many years of research in developing many structure based virtual screening approaches. This book is the first monograph that summarizes innovative applications of efficient chemoinformatics approaches towards the goal of screening large chemical libraries. The focus on virtual screening expands chemoinformatics beyond its traditional boundaries as a synthetic and data-analytical area of research towards its recognition as a predictive and decision support scientific discipline. The approaches discussed by the contributors to the monograph rely on chemoinformatics concepts such as: -representation of molecules using multiple descriptors of chemical structures -advanced chemical similarity calculations in multidimensional descriptor spaces -the use of advanced machine learning and data mining approaches for building quantitative and predictive structure activity models -the use of chemoinformatics methodologies for the analysis of drug-likeness and property prediction -the emerging trend on combining chemoinformatics and bioinformatics concepts in structure based drug discovery The chapters of the book are organized in a logical flow that a typical chemoinformatics project would follow - from structure representation and comparison to data analysis and model building to applications of structure-property relationship models for hit identification and chemical library design. It opens with the overview of modern methods of compounds library design, followed by a chapter devoted to molecular similarity analysis. Four sections describe virtual screening based on the using of molecular fragments, 2D pharmacophores and 3D pharmacophores. Application of fuzzy pharmacophores for libraries design is the subject of the next chapter followed by a chapter dealing with QSAR studies based on local molecular parameters. Probabilistic approaches based on 2D descriptors in assessment of biological activities are also described with an overview of the modern methods and software for ADME prediction. The book ends with a chapter describing the new approach of coding the receptor binding sites and their respective ligands in multidimensional chemical descriptor space that affords an interesting and efficient alternative to traditional docking and screening techniques. Ligand-based approaches, which are in the focus of this work, are more computationally efficient compared to structure-based virtual screening and there are very few books related to modern developments in this field. The focus on extending the experiences accumulated in traditional areas of chemoinformatics research such as Quantitative Structure Activity Relationships (QSAR) or chemical similarity searching towards virtual screening make the theme of this monograph essential reading for researchers in the area of computer-aided drug discovery. However, due to its generic data-analytical focus there will be a growing application of chemoinformatics approaches in multiple areas of chemical and biological research such as synthesis planning, nanotechnology, proteomics, physical and analytical chemistry and chemical genomics.
- Royal Society of Chemistry, The
- Publication date:
- Product dimensions:
- 6.30(w) x 9.30(h) x 1.00(d)
Read an Excerpt
Chemoinformatics Approaches to Virtual Screening
By Alexandre Varnek, Alex Tropsha
The Royal Society of ChemistryCopyright © 2008 Royal Society of Chemistry
All rights reserved.
Fragment Descriptors in SAR/QSAR/QSPR Studies, Molecular Similarity Analysis and in Virtual Screening
IGOR BASKIN AND ALEXANDRE VARNEK
Chemoinformatics is an emerging science that concerns the mixing of chemical information resources to transform data into information, and information into knowledge. It is a branch of theoretical chemistry based on its molecular model, and which uses its own basic concepts, learning approaches and areas of application. Unlike quantum chemistry, which considers molecules as ensemble of electrons and nuclei, or force field molecular mechanics or dynamics simulations based on a classical molecular model ("atoms" and "bonds"), chemoinformatics represents molecules as objects in a chemical space defined by molecular descriptors. Among thousands of descriptors, fragment descriptors occupy a special place. Fragment descriptors represent selected subgraphs of a 2D molecular graph; structure-property approaches use their occurrences in molecules or binary values (0, 1) to indicate their presence or absence in the given graph.
The unique properties of fragment descriptors are related to the fact that (i) any molecular graph invariant (i.e., any molecular descriptor or property) can be uniquely represented as a linear combination of fragment descriptors; (ii) any symmetric similarity measure can be uniquely expressed in terms of fragment descriptors; and (iii) any regression or classification structure–property model can be represented as a linear equation involving fragment descriptors.
An important advantage of fragment descriptors is related to the simplicity of their calculation, storage and interpretation (see review articles). They belong to information-based descriptors, which tend to code the information stored in molecular structures. This contrasts with knowledge-based (or semi-empirical) descriptors derived from consideration of the mechanism of action. Owing to their versatility, fragment descriptors can efficiently be used to build structure–property models, perform similarity search, virtual screening and in silico design of chemical compounds with desired properties.
This chapter reviews fragment descriptors with respect to their use in structure–property studies, similarity search and virtual screening. After a short historical survey, different types of fragment descriptors are considered thoroughly. This is followed by a brief review of the application of fragment descriptors in virtual screening, focusing mostly on filtering, similarity search and direct activity/property assessment using quantitative structure–property models.
1.2 Historical Survey
Among a multitude of descriptors currently used in Structure-Activity Relationships/Quantitative Structure–Activity Relationships/Quantitative Structure-Property Relationships (SAR/QSAR/QSPR) studies, fragment descriptors occupy a special place. Their application as atoms and bonds increments in the framework of additive schemes can be traced back to the 1930–1950s; Vogel, Zahn, Souder, Franklin, Tatevskii, Bernstein, Laidler, Benson and Buss and Allen pioneered this field. Smolenskii was one of the first, in 1964, to apply graph theory to tackle the problem of predictions of the physicochemical properties of organic compounds. Later on, these first additive schemes approaches have gradually evolved into group contribution methods. The latter are closely linked with thermodynamic approaches and, therefore, they are applicable only to a limited number of properties.
The epoch of QSAR (Quantitative Structure-Activity Relationships) studies began in 1963–1964 with two seminal approaches: the σ-ρ-π analysis of Hansch and Fujita and the Free-Wilson method. The former approach involves three types of descriptors related to electronic, steric and hydrophobic characteristics of substituents, whereas the latter considers the substituents themselves as descriptors. Both approaches are confined to strictly congeneric series of compounds. The Free–Wilson method additionally requires all types of substituents to be sufficiently present in the training set. A combination of these two approaches has led to QSAR models involving indicator variables, which indicate the presence of some structural fragments in molecules.
The non-quantitative SAR (Structure–Activity Relationships) models developed in the 1970s by Hiller, Golender and Rosenblit, Piruzyan, Avidon et al., Cramer, Brugger, Stuper and Jurs, and Hodes et al. were inspired by the, at that time, popular artificial intelligence, expert systems, machine learning and pattern recognition paradigms. In those approaches, chemical structures were described by means of indicators of the presence of structural fragments interpreted as topological (or 2D) pharmacophores (biophores, toxophores, etc.) or topological pharmacophobes (biophobes, toxophobes, etc.). Chemical compounds were then classified as active or inactive with respect to certain types of biological activity.
Methodologies based on fragment descriptors in QSAR/QSPR studies are not strictly confined to particular types of properties or compounds. In the 1970s Adamson and coworkers were the first to apply fragment descriptors in multiple linear regression analysis to find correlations with some biological activities, physicochemical properties, and reactivity.
An important class of fragment descriptors, the so-called screens (or structural keys, fingerprints), were also developed in 1970s. As a rule, they represent the bit strings that can effectively be stored and processed by computers. Although their primary role is to provide efficient substructure searching in large chemical structure databases, they can be efficiently used also for similarity searching, clustering large chemical databases, assessing their diversity, as well as for SAR and QSAR modeling.
Another important contribution was made in 1980 by Cramer who invented BC(DEF) parameters obtained by means of factor analysis of the physical properties of 114 organic liquids. These parameters correlate strongly with various physical properties of diverse liquid organic compounds. On the other hand, they could be estimated by linear additive-constitutive models involving fragment descriptors. Thus, a set of QSPR models encompassing numerous physical properties of diverse organic compounds has been developed using only fragment descriptors.
One of the most important developments of the 1980s was the CASE (Computer-Automated Structure Evaluation) program by Klopman et al. This "self-learning artificial intelligent system" can recognize activating and deactivating fragments (biophores and biophobes) with respect to the given biological activity and to use this information to determine the probability that a test chemical is active. This methodology has been successfully applied to predict various types of biological activity: mutagenicity, carcinogenicity, hallucinogenic activity, anticonvulsant activity, inhibitory activity with respect to sparteine monooxygenase, β-adrenergic activity μ-receptor binding (opiate) activity, antibacterial activity, antileukemic activity, etc. Using the multivariate regression technique, CASE can also build quantitative models involving fragment descriptors.
Starting in the early 1990s, various approaches and related software tools based on fragment descriptors have been developed and are listed in several conceptual and mini-review papers. Because of the wide scope and large variety of different approaches and applications in this field, many important ideas were reinvented many times and continue to be reinvented. In this review we try to present a clear state-of-the-art picture in this area.
1.2 Main Characteristics of Fragment Descriptors
In this section different types of fragments are classified with respect to their topology and the level of abstraction of molecular graphs.
1.3.1 Types of Fragments
A tremendous number of various fragments are used in structure-property studies: atoms, bonds, "topological torsions", chains, cycles, atom- and bond-centered fragments, maximum common substructures, line notation (WLN and SMILES) fragments, atom pairs and topological multiplets, substituents and molecular frameworks, basic subgraphs, etc. Their detailed description is given below.
Depending on the application area, two types of values taken by fragment descriptors are considered: binary and integer. Binary values indicate the presence (true, yes, 1) or the absence (false, no, 0) of a given fragment in a structure. They are usually used as screens and elements of fingerprints for chemical database management and virtual screening using similarity-based approaches as well as in SAR studies. Integer values corresponding to the occurrences of fragments in structures are used in QSAR/QSPR modeling.
126.96.36.199 Simple Fixed Types
Disconnected atoms represent the simplest type of fragments. They are used to assess a chemical or biological property P in the framework of an additive scheme based on atomic contributions:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.1)
where ni is the number of atoms of i-type, Ai is corresponding atomic contributions. Usually, the atom types account for not only the type of chemical element but also hybridization, the number of attached hydrogen atoms (for heavy elements), occurrence in some groups or aromatic systems, etc. Nowadays, atom-based methods are used to predict some physicochemical properties and biological activities. Thus, several works have been devoted to assess the octanol–water partition coefficient log P: the ALOGP method by Ghose-Crippen, later modified by Ghose and co-workers, and by Wildman and Crippen, the CHEMICALC-2 method by Suzuki and Kudo, the SMILOGP program by Convard and co-authors, and the XLOGP method by Wang and co-authors. Hou and co-authors used Equation (1.1) to calculate aqueous solubility. The ability of this approach to assess biological activities was demonstrated by Winkler et al.
Chemical bonds are another type of simple fragment. The first bond-based additive schemes, such as those of Zahn, Bernstein and Allen, appeared almost simultaneously with the atom-based ones and dealt, presumably, with predictions of some thermodynamic properties.
"Topological torsions" invented Nilakantan et al. are defined as a linear sequence of four consecutively bonded non-hydrogen atoms. Each atom there is described by the type of corresponding chemical element, the number of attached non-hydrogen atoms and the number of p-electron pairs. Molecular descriptors indicating the presence or absence of topological torsions in chemical structures have been used to perform qualitative predictions of biological activity in structure-activity (SAR) studies. Later on, Kearsley et al. recognized that characterizing atoms by element types can be too specific for similarity searching and, therefore, it does not provide sufficient flexibility for large-scaled virtual screening. To solve this problem, they suggested assigning atoms in the Carhart's atom pairs and Nilakantan's topological torsions to one of seven classes: cations, anions, neutral hydrogen bond donors, neutral hydrogen bond acceptors, polar atoms, hydrophobic atoms and other.
The above-mentioned structural fragments – atoms, bonds and topological torsions – can be regarded as chains of different lengths. Smolenskii suggested using the occurrences of chains in an additive scheme to predict the formation enthalpy of alkanes. For the last four decades, chain fragments have proved to be one of the most popular and useful type of fragment descriptors in QSPR/ QSAR/SAR studies. Fragment descriptors based on enumerating chains in molecular graphs are efficiently used in many popular structure-property and structure-activity programs: CASE and MULTICASE (MultiCASE, MCASE) by Klopman, NASAWIN by Baskin et al, BIBIGON by Kumskov, TRAIL and ISIDA by Solov'ev and Varnek. "Molecular pathways" by Gakh and co-authors, and "molecular walks" by Rücker, represent chains of atoms.
In contrast to chains, cyclic and polycyclic fragments are relatively rarely applied as descriptors in QSAR/QSPR studies. Nevertheless, implicitly cyclicity is accounted for by means of: (i) introducing special "cyclic" and "aromatic" types of atoms and bonds, (ii) "collapsing" the whole cycles and even polycyclic systems into "pharmacophoric" pseudo-atoms and (iii) generating cyclic fragments as a part of large fragments [Maximum Common Substructure (MCS), molecular framework, substituents]. Besides, the cyclic fragments are widely used as screens for chemical database processing.
188.8.131.52 WLN and SMILES Fragments
WLN and SMILES fragments correspond respectively to substrings of the Wiswesser Line Notation or Simplified Molecular Input Line Entry System strings used for encoding the chemical structures. Since simple string operations are much faster than processing of information in connection tables, the use of WLN descriptors was justified in the 1970s when computers were still very slow. At that time Adamson and Bawden published some linear QSAR models based on WLN fragments. They have also applied this kind of descriptor for hierarchical cluster analysis and automatic classification of chemical structures. Qu et al. have developed AES (Advanced Encoding System), a new WLN-based notation encoding chemical information for group contribution methods. Interest in line notation descriptors has not disappeared completely with the advent of powerful computers. Thus, SMILES fragment descriptors are used in the SMILOGP program to predict log P, whereas the recently developed LINGO system for assessing some biophysical properties and intermolecular similarities uses holographic representations of canonical SMILES strings.
184.108.40.206 Atom-centered Fragments
Atom-Centered Fragments (ACF) consist of a single central atom surrounded by one or several shells of atoms separated from the central one by the same topological distance. This type of structural fragments was introduced in the early 1950s by Tatevskii, and then by Benson to predict some physicochemical properties of organic compounds in the framework of additive schemes.
ACF fragments containing only one shell of atoms around the central one (i.e, atom-centered neighborhoods of radius 1) were introduced into chemoinformatics practice in 1971 under the names "atom-centered fragments" and "augmented atoms" by Adamson, who studied their distribution in large chemical databases with the intention of using them as screens in chemical database searching. Hodes used, in SAR studies, both "augmented atoms" and "ganglia augmented atoms" representing ACF fragments with radius 2 and generalized second-shell atoms. Subsequently, ACF fragments with radius 1 were implemented in NASAWIN, TRAIL and ISIDA programs. ACF fragments with arbitrary radius were implemented by Filimonov, Poroikov and co-authors in the PASS program under the name Multilevel Neighborhoods of Atoms (MNA), by Xing and Glen as "tree structured fingerprints", by Bender and Glen as "atom environments" and "circular fingerprints" (Figure 1.1), and by Faulon as "molecular signatures".
Several types of ACF fragments were designed to store local spectral parameters (chemical shifts) in spectroscopy data bases. Thus, Bremser has developed Hierarchically Ordered Spherical Environment (HOSE), a system of substructure codes aimed at characterizing the spherical environment of single atoms and complete ring systems. The codes are generated automatically from 2D graphs and describe structural entities corresponding to chemical shifts. A very similar idea has also been implemented by Dubois et al. in the DARC system based on FREL (Fragment Réduit à un Environment Limité) fragments. Xiao et al. have applied Atom-Centered Multilayer Code (ACMC) fragments for structural and substructural searching in large data-bases of compounds and reactions. An important recent application of ACF fragments concerns target prediction ("target fishing") in chemogenomic data analysis.
Excerpted from Chemoinformatics Approaches to Virtual Screening by Alexandre Varnek, Alex Tropsha. Copyright © 2008 Royal Society of Chemistry. Excerpted by permission of The Royal Society of Chemistry.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Meet the Author
Alexandre Varnek is Professor in Theoretical Chemistry at the Louis Pasteur University (ULP) France, and Head of the Laboratory of Chemoinformatics, Director of Master Courses on Chemoinformatics at the Faculty of Chemistry, ULP. He has 30 years experience in the fields of molecular modelling and chemoinformatics and more than 80 publications including a monograph. His current research projects include the development of new approaches and software tools for in silico design of new compounds. Alexander Tropsha is Head of the Laboratory for Molecular Modeling, School of Pharmacy at the University of North Carolina, Chapel Hill, USA as well as Professor and Chair, Division of Medicinal Chemistry and Natural Products at the School of Pharmacy. His research interests include Computer-Aided Drug Design, Chemoinformatics, and Structural Bioinformatics. He has authored or co-authored over 110 peer-reviewed research papers and book chapters.
Most Helpful Customer Reviews
See all customer reviews