Department of Biostatistics
University of Washington
According to the National Institute of Health, a genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease or condition. Whole genome information, when combined with… See more details below
According to the National Institute of Health, a genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease or condition. Whole genome information, when combined with clinical and other phenotype data, offers the potential for increased understanding of basic biological processes affecting human health, improvement in the prediction of disease and patient care, and ultimately the realization of the promise of personalized medicine. In addition, rapid advances in understanding the patterns of human genetic variation and maturing high-throughput, cost-effective methods for genotyping are providing powerful research tools for identifying genetic variants that contribute to health and disease. (good paragraph)
This burgeoning science merges the principles of statistics and genetics studies to make sense of the vast amounts of information available with the mapping of genomes. In order to make the most of the information available, statistical tools must be tailored and translated for the analytical issues which are original to large-scale association studies. This book will provide researchers with advanced biological knowledge who are entering the field of genome-wide association studies with the groundwork to apply statistical analysis tools appropriately and effectively. With the use of consistent examples throughout the work, chapters will provide readers with best practice for getting started (design), analyzing, and interpreting data according to their research interests. Frequently used tests will be highlighted and a critical analysis of the advantages and disadvantage complimented by case studies for each will provide readers with the information they need to make the right choice for their research. Additional tools including links to analysis tools, tutorials, and references will be available electronically to ensure the latest information is available.
* Easy access to key information including advantages and disadvantage of tests for particular applications, identification of databases, languages and their capabilities, data management risks, frequently used tests
* Extensive list of references including links to tutorial websites
* Case studies and Tips and Tricks
P.C. Sham, S.S. Cherny Department of Psychiatry and the State Key Laboratory of Brain and Cognitive Sciences, LKS Faculty of Medicine, University of Hong Kong, Hong Kong
Genetic Modeling: Twin, Adoption and Family Studies 2
Disease Gene Mapping: Linkage Studies 5
Disease Gene Mapping: Association Studies 7
Most common human diseases, such as coronary heart disease, diabetes, cancers, bipolar affective disorders and schizophrenia, have complex etiology. While they tend to cluster in families, they do not exhibit the characteristic Mendelian segregation ratios of single-gene disorders. These diseases therefore cannot be solely caused by a single genetic mutation, but have a more complex genetic architecture. For any complex disease, many questions about its genetic architecture can be raised. How many genetic variants are involved in individual differences in the propensity to develop the disease (e.g., just a handful, tens, hundreds, or thousands)? Where are these sequence changes located on the 23 chromosomes that constitute the human genome? What is the nature of the sequence changes in these variants (e.g., single base pair changes, copy number changes, etc.)? What are the functional consequences of these changes (e.g., change of amino acid sequence and therefore protein structure, or changes in the level or regulation of gene expression)? What are the frequencies and effect sizes of these changes? How important are these changes relative to the environmental variation in explaining individual differences in disease susceptibility? And how do the genetic changes interact with each other and with environmental factors?
This chapter reviews the research approaches that have been used to address some of the above questions and summarizes our current state of knowledge and understanding regarding the genetic architecture of complex diseases that have come out of these studies. The general conclusion is that common diseases are highly heterogeneous, with a small proportion of cases having relatively simple etiology dominated by a single genetic mutation, while the vast majority of cases are caused by the combined effect of multiple genetic and environmental factors each contributing a minor influence.
The genetics approach to the study of complex diseases is complementary to other research paradigms such as the use of cell culture or animal models. The advantages of the genetics approach are that: (1) the finite size and regularity of the genome allows a systematic search for sequence-phenotype relationships, which may unveil novel associations that implicate previously unsuspected biological pathways, and (2) the demonstration of sequence-phenotype relationships offers strong direct evidence for the role of a gene or a pathway in human disease, minimizing the need to perform potentially hazardous experiments on humans. On the other hand, the genetics approach is limited in its ability to tease apart detailed molecular mechanisms involved in disease etiology. The genetics approach is therefore a valuable complement, rather than an alternative, to other biological approaches.
GENETIC MODELING: TWIN, ADOPTION AND FAMILY STUDIES
One immediate question regarding the genetic architecture of complex traits is the relative importance of genetic versus environmental factors in explaining the individual differences in disease susceptibility in the population. If genetic factors are relatively unimportant, then further genetic studies may be unwarranted, and research efforts should be directed at environmental factors. On the other hand, if the contribution of genetic factors is substantial, then further genetic studies may help to identify the specific genetic variants involved and elucidate the mechanisms by which these variants influence disease propensity. The proportion of the total variance in disease liability that is explained by genetic (as against environmental) factors is defined as the heritability of the disease. It is important to appreciate that heritability is dependent on the genetic and environmental variations present in a population, so that changes in the variability of genetic or environmental factors can both lead to changes in heritability.
Heritability is typically estimated from twin and adoption studies (see for an overview of methods for estimation of genetic and environmental components of variance). The principle of twin studies is as follows. Identical or monozygotic (MZ) twins share 100% of their genomes, while fraternal or dizygotic (DZ) twins share on average only 50% of their genomes by common descent. On the other hand, for twin pairs who are reared together, their sharing of environmental exposures may be the same regardless of zygosity. Thus, the presence of a greater phenotypic similarity among MZ than DZ twins can be attributed to the greater genetic similarity of MZ than DZ twins. Indeed, if the phenotype is a continuous trait, then the phenotypic similarity within twin pairs can be measured by an intraclass correlation, and the heritability estimated by twice the difference in intraclass correlations in MZ and DZ twins.
The use of twin studies for heritability estimation of disease phenotypes is more complicated because of the dichotomous nature of the phenotype. This is usually done via a liability-threshold model, where the underlying liability is normally distributed in the population, and those individuals with liability above a certain threshold value develop disease. The twin data is then used to estimate the twin correlations for the underlying liability (tetrachoric correlations), and then the heritability can be estimated by twice the difference in these correlations between MZ and DZ twins. As seen in Table 1.1, heritability estimates from twin studies on a number of complex diseases range from 40% to as high as 90%, which are typical of most complex traits.
Heritability can also be estimated by adoption studies, including MZ twins reared apart, whose correlation gives a direct estimate of heritability. In general, the correlations between biological relatives who have been separated by adoption provide estimates of heritability, whereas the correlations between adoptive relatives reared together but are biologically unrelated provide estimates for the influence of the family environment (see Plomin & Loehlin for a discussion of direct estimates of heritability). Arguably the most prominent among such studies, the Minnesota Study of Twins Reared Apart, confirms that practically all complex traits have a substantial genetic component (e.g., [13-15]).
The modeling of twin, adoption and family data can be used to address other important questions about the genetic architecture of complex disorders. For example, two different diseases can be modeled simultaneously, to detect shared genetic influences on the two diseases. In twin studies, shared genetic influences would be indicated by the presence of cross-trait cross-twin correlation (i.e., correlation between disease 1 in twin 1 and disease 2 in twin 2) for both MZ and DZ twins, but which is greater in MZ than in DZ twins. Such studies have indicated substantial genetic sharing for some complex diseases; for example, schizophrenic with manic symptoms, and bipolar disorder with unipolar depression. Differences in the genetic influences on disease liability between males and females, for different ages, or under different environmental conditions, can also be modeled in twin data. For example, Kendler et al. found that twin similarity for social phobia was due primarily to genetic influences in males but a result of shared environmental influences in females.
Another type of genetic modeling is aimed not at estimating the relative importance of genetic against environmental factors, but at whether the genetic component is made up of a single genetic factor of major effect (the single major locus [SML] model), or a large number of genetic factors each of small effect (the polygenic model). These are two extreme scenarios, and other possible models include the presence of a major locus on a polygenic background (the mixed model), or a few loci of major effect (the oligogenic model).
One approach to discrimination between different genetic models is to consider the drop-off in recurrence risk of disease as a function of genetic relationship to an affected index case (proband). A polygenic model is predicted to have a steeper drop-off than an SML model; the empirical recurrence risks for schizophrenia appear to be more consistent with a polygenic than an SML model. An alternative, more sophisticated method for discriminating between different genetic models is complex segregation analysis, which uses maximum likelihood on family data to fit model parameters and test different models. Thus, the presence of an SML can be tested by a likelihood ratio test of a mixed model with both SML and polygenic components, against a polygenic model. An alternative test for the presence of an SML considers a generalized transmission model in which genetic transmissions are allowed to deviate from Mendelian proportions, against a model in which genetic transmissions are constrained to Mendelian proportions. Complex segregation analysis has been applied to complex diseases with largely inconclusive results. This is because complex segregation analysis suffers from both low statistical power which throws doubt on negative results, and from numerous possible artifacts which throw doubt on positive results. Thus, a complex segregation analysis showed strong but likely erroneous evidence for an SML effect for medical school attendance. The problem is that the model did not incorporate sibling environment and therefore could account for a higher sibling concordance than parenteoff—spring concordance for medical school attendance only through a recessive SML.
Notable oligogenic models for complex diseases were proposed by Risch. In these models, the effects of the different loci on disease risk can combine in an additive or multiplicative fashion. Risch derived, under each model, how the overall disease prevalence and recurrence risks can be related to the effects of the individual loci. Risch considered that the pattern of recurrence risks in schizophrenia is consistent with an oligogenic model with three to five loci. This would mean that these loci must have quite large effects, providing optimism for studies which aim to identify individual susceptibility loci.
DISEASE GENE MAPPING: LINKAGE STUDIES
Linkage is based on the co-segregation of marker variants and disease in families. In humans, the frequency of crossovers in meiosis is such that each gametic genome has on average only around 35 crossover points. Co-segregation should therefore be detectable for marker loci quite far away from the disease-causing variant. Because linkage operates over long genetic distances, a positional mapping approach based on linkage can cover the entire genome by using a relatively small number of highly polymorphic markers. Standard marker sets for whole-genome linkage scans, based on 200–800 microsatellite polymorphisms, which became available in the 1990s, enabled the successful mapping of hundreds of rare single-gene disorders.
Classical linkage analysis is typically carried out using the lod-score method, which is based on a single major locus parameterized by the disease allele frequency and the penetrances (conditional probability of disease) of the three disease locus genotypes. For Mendelian diseases, the values of these parameters can be easily specified from the results of population prevalence studies and segregation analyses. For complex disease, the SML model is likely to be simplistic, and appropriate values of the model parameters unknown, and indeed may be different for different loci. Nevertheless, classical linkage analysis has been optimistically applied to complex diseases, particularly on pedigrees with an unusually large number of affected individuals. There have been some successes of this approach; for example, the identification of loci responsible for early-onset familial breast cancer, maturity-onset diabetes in the young and early-onset Alzheimer's disease. However, the patients in these successful linkage studies represent only a very small proportion (<5%) of the overall incidence of each of the complex disease. When successful, the families involved are usually large and have a pattern of inheritance that is very close to autosomal dominant with high penetrance. For collections of smaller families with less clear-cut Mendelian inheritance, the results of classical linkage analysis are much more often unconvincing and difficult to replicate. Examples of non-replicated linkage findings include schizophrenia and bipolar affective disorder.
The lack of success in classical linkage analysis for the majority of cases of complex disease suggests that the genetic variants involved in such disorders typically have a small effect on disease risk. Thus multiple genetic variants are often involved in a single family, with the result that the co-segregation between disease and any single variant would be imperfect and different families will show linkage at different loci. It follows that for complex disorders, very large family samples are required to demonstrate conclusive linkage between disease and genetic markers.
A different version of linkage analysis, called non-parametric linkage, is based on excess local allele sharing between the affected relatives, above the level expected for the degree of relationship. For example, sibling pairs are expected to share on average one of the two alleles at any locus, and a locus which shows a significant excess of allele sharing above this level for affected sibling pairs would constitute evidence of linkage with the disease. This method has been considered to be more appropriate for complex diseases as it does not assume an SML. The non-parametric linkage approach, usually based on affected sib-pairs, became popular in the 1990s. The approach was used successfully, for example, on late-onset Alzheimer's disease, to identify a linkage region on chromosome 19, which was subsequently found to contain a major susceptibility variant, the APOE [element of] 4 allele. Other studies using this approach (for example on multiple sclerosis, schizophrenia and autism) have been less successful. The problem is that, while non-parametric linkage does not require the assumption of a single major locus model, the statistical power to detect linkage is nevertheless highly sensitive to the effect size of the susceptibility variant. For realistic sample sizes, only genetic effects which account for a substantial fraction (e.g., 20%) of the disease heritability are likely to be detected. The lack of success of nonparametric linkage analysis for many complex diseases would exclude the presence of genes of major effect. However, even variants which account for as much as 10% of the disease heritability are likely to go undetected because of inadequate statistical power.
Excerpted from ANALYSIS OF COMPLEX DISEASE ASSOCIATION STUDIES Copyright © 2011 by Elsevier Inc.. Excerpted by permission of Academic Press. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
and post it to your social network
See all customer reviews >