Introduction to Bioinformatics with R: A Practical Guide for Biologists / Edition 1

Introduction to Bioinformatics with R: A Practical Guide for Biologists / Edition 1

by Edward Curry
ISBN-10:
1138495719
ISBN-13:
9781138495715
Pub. Date:
11/03/2020
Publisher:
CRC Press
ISBN-10:
1138495719
ISBN-13:
9781138495715
Pub. Date:
11/03/2020
Publisher:
CRC Press
Introduction to Bioinformatics with R: A Practical Guide for Biologists / Edition 1

Introduction to Bioinformatics with R: A Practical Guide for Biologists / Edition 1

by Edward Curry
$71.99 Current price is , Original price is $71.99. You
$71.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Overview

In biological research, the amount of data available to researchers has increased so much over recent years, it is becoming increasingly difficult to understand the current state of the art without some experience and understanding of data analytics and bioinformatics. An Introduction to Bioinformatics with R: A Practical Guide for Biologists leads the reader through the basics of computational analysis of data encountered in modern biological research. With no previous experience with statistics or programming required, readers will develop the ability to plan suitable analyses of biological datasets, and to use the R programming environment to perform these analyses. This is achieved through a series of case studies using R to answer research questions using molecular biology datasets. Broadly applicable statistical methods are explained, including linear and rank-based correlation, distance metrics and hierarchical clustering, hypothesis testing using linear regression, proportional hazards regression for survival data, and principal component analysis. These methods are then applied as appropriate throughout the case studies, illustrating how they can be used to answer research questions.

Key Features:

· Provides a practical course in computational data analysis suitable for students or researchers with no previous exposure to computer programming.

· Describes in detail the theoretical basis for statistical analysis techniques used throughout the textbook, from basic principles

· Presents walk-throughs of data analysis tasks using R and example datasets. All R commands are presented and explained in order to enable the reader to carry out these tasks themselves.

· Uses outputs from a large range of molecular biology platforms including DNA methylation and genotyping microarrays; RNA-seq, genome sequencing, ChIP-seq and bisulphite sequencing; and high-throughput phenotypic screens.

· Gives worked-out examples geared towards problems encountered in cancer research, which can also be applied across many areas of molecular biology and medical research.

This book has been developed over years of training biological scientists and clinicians to analyse the large datasets available in their cancer research projects. It is appropriate for use as a textbook or as a practical book for biological scientists looking to gain bioinformatics skills.


Product Details

ISBN-13: 9781138495715
Publisher: CRC Press
Publication date: 11/03/2020
Series: Chapman & Hall/CRC Computational Biology Series
Pages: 310
Product dimensions: 6.12(w) x 9.19(h) x (d)

About the Author

Ed Curry initially studied computer science (Cambridge) and AI with a systems biology specialism (Edinburgh) before embarking on a PhD in computer-based molecular biology, studying stem cell differentiation at the Centre for Regenerative Medicine in Edinburgh. He spent 10 years in the Faculty of Medicine at Imperial College London, during which time he established a research group focusing on interactions between the genetic, epigenetic and transcriptional state of cancer cells during carcinogenesis and the acquisition of drug resistance. He has extensive teaching experience as a lecturer, examiner and course director, including co-founding Imperial College’s Cancer Informatics MRes program and the Genetics & Genomics module for the BSc in Medical Biosciences. He joined GSK R&D in October 2019, remaining an honorary lecturer at Imperial College.

Table of Contents

Acknowledgements xi

1 Introduction 1

1.1 Why informatics is important for biologists 1

1.2 How to use this book 2

2 Introduction to R 5

2.1 Obtaining R 5

2.1.1 Downloading R 5

2.1.2 Installing R 6

2.2 R console 6

2.2.1 Starting the R console 7

2.3 The R workspace 7

2.3.1 Creating/deleting objects 8

2.3.2 The working directory 8

2.4 Data handling 10

2.4.1 Basic data types 10

2.4.2 Vectors 11

2.4.3 Arrays 11

2.4.4 Lists 12

2.4.5 Data frames 14

2.4.6 Data input/output 15

2.5 More advanced concepts: Scripts and functions 16

2.5.1 Simple scripts 16

2.5.2 Functions 17

2.5.3 Using 'apply' 19

2.5.3.1 Apply 19

2.5.3.2 Sapply 20

2.5.3.3 Lapply 22

2.5.3.4 Mapply 23

2.6 Plots 24

2.6.1 Simple scatterplot 24

2.6.2 Arguments of plot () 25

2.6.3 Multiple plots on one graph 25

2.6.4 Scatterplots of multiple variables 25

2.6.5 Box plots 25

2.6.6 Saving images to file 27

2.7 More advanced graphics with ggplot2 27

2.8 Using R help 30

3 An Introduction to LINUX for Biological Research 31

3.1 UNIX 31

3.2 Linux survival guide 32

3.3 Useful dependencies and programs 37

4 Statistical Methods for Data Analysis 39

4.1 What are statistical methods, and why do we use them in biological research? 39

4.1.1 A worked example 40

4.1.2 A brief summary 43

4.2 What do I need to understand statistics? 43

4.2.1 Probability 43

4.2.1.1 Random variables 43

4.2.1.2 Probability distributions 45

4.2.1.3 Hypothesis testing 47

4.2.2 Linear algebra 52

4.2.3 Summary 53

4.3 Normalization: Removing technical variation 53

4.3.1 Centering and scaling 55

4.3.2 An illustrative example 58

4.3.3 Quantile normalization 59

4.3.4 Batch effects 59

4.4 Correlation 60

4.4.1 Pearson correlation coefficient 60

4.4.2 Spearman's rank correlation 61

4.4.3 Examples 61

4.5 Clustering 65

4.5.1 Clustering illustration using R 66

4.6 Linear regression models 69

4.6.1 Limma 72

4.6.1.1 Installing limma 73

4.6.1.2 Categorical explanatory variables 73

4.6.1.3 Continuous explanatory variables 76

4.7 Multiple hypothesis testing 78

4.8 Survival analysis 79

4.8.1 Kaplan-Meier plots 79

4.8.2 Cox proportional hazards regression models 81

4.9 Projection methods 81

4.9.1 PCA 82

4.9.2 PLS 85

4.10 Resampling: Permutation tests and the bootstrap 86

4.11 Stability and robustness 87

4.12 Summary 87

5 Analyzing Generic Tabular Numeric Datasets in R 89

5.1 Introduction 89

5.2 Loading data into R 89

5.3 Data visualisation 92

5.3.1 Scatter plots 92

5.3.2 Box plots 93

5.3.3 Bar charts 94

5.4 Correlation and clustering 94

5.4.1 Correlation 95

5.4.2 Clustering 98

5.4.3 Heatmaps 101

5.5 Statistical analysis using linear models 103

5.5.1 Comparison of two groups 104

5.5.2 Alternative models 106

5.6 Summary 107

6 Functional Enrichment Analysis 109

6.1 Introduction 109

6.2 Loading gene sets into R 109

6.3 Over-representation 112

6.3.1 Online tools 113

6.3.2 Testing gene sets in R 113

6.4 Systematic enrichment 117

6.4.1 Online tools 117

6.4.2 Testing gene sets in R 117

6.5 Summary 120

7 Integrating Multiple Datasets in R 121

7.1 Introduction 121

7.2 Data import 123

7.3 Exploratory data analysis 123

7.4 Integrating multiple datasets 131

7.4.1 Survival analysis 134

7.5 Multiple molecular endpoints 141

7.6 Summary 143

8 Analyzing Microarray Data in R 145

8.1 Bioconductor 146

8.2 Accessing microarray data from GEO 147

8.3 Single-channel array analysis 148

8.4 Loading data 148

8.5 Data visualisation 149

8.5.1 Image plots 150

8.5.2 MA plots 151

8.5.3 Scatterplots 151

8.5.4 Box plots 153

8.6 Normalizing data 155

8.7 Differential expression (linear models) 158

8.7.1 Design matrix 159

8.7.2 Fitting linear models 160

8.7.3 Making use of the results 161

8.7.4 Postscript: Assumptions 164

8.8 Clustering and correlation 164

8.8.1 Expression profiles 164

8.8.2 Correlation 165

8.9 Clustering 169

8.9.1 Filtering 171

8.10 Survival analysis 175

8.10.1 Kaplan-Meier plots 178

8.10.2 Cox proportional hazards regression 183

8.11 Footnote: Correlation to explore associated functions 187

9 Analyzing DNA Methylation Microarray Data in R 189

9.1 Introduction 189

9.2 Importing raw data 190

9.3 Quality control 191

9.4 Normalization and estimating methylation level 193

9.5 Analyzing beta values 194

9.6 Using previously preprocessed data 197

9.7 Further analyses using minfi 200

10 DNA Analysis with Microarrays 203

10.1 Introduction 203

10.2 Genotyping 203

10.2.1 Normalization 204

10.2.2 Genotype calling 205

10.2.3 Downstream analysis: Genome-wide association tests 208

10.3 Copy number analysis 210

10.3.1 Normalization 211

10.3.2 Copy number estimation 212

10.3.3 Segmentation 212

10.3.3.1 Hidden Markov model 213

10.3.3.2 Circular binary segmentation 216

10.3.4 Downstream analysis 217

10.3.4.1 Mapping CNA data to genes 217

10.3.4.2 Finding frequently-mutated genes 220

10.4 Summary 221

11 Working with Sequencing Data 223

11.1 Introduction 223

11.2 Sequence data analysis tasks 224

11.3 Quality control 224

11.3.1 Base call quality filtering 226

11.3.2 Adapter trimming 228

11.4 Alignment 230

11.4.1 Bowtie 231

11.4.2 BWA 232

11.4.3 Post-alignment filtering 233

11.4.4 Removing duplicate reads 233

11.5 Obtaining sequencing data from the SRA 235

12 Genomic Sequence Profiling 239

12.1 Introduction 239

12.2 SNV: Single nucleotide variants 239

12.3 Variant filtering and annotation 241

12.4 Indels: Short insertions and deletions 244

12.5 SV: Structural variants 245

12.6 Making use of variant calls 246

12.7 Summary 256

13 ChIP-seq 259

13.1 Introduction 259

13.2 Cross-correlation 259

13.3 Filtering blacklisted reads 263

13.4 Peak calling 263

13.5 Peak annotation 265

13.6 Quantitative comparisons of ChIP-seq libraries 267

13.7 Summary 270

14 RNA-seq 271

14.1 Introduction 271

14.2 Obtaining RNA-seq data from GEO 272

14.3 Transcript quantification via pseudoalignment 273

14.3.1 Building a transcript index 273

14.3.2 Quantifying transcripts using reads 274

14.3.3 Downstream analysis 275

14.4 Analysis with transcriptome assembly 278

14.4.1 Building the transcriptome directly 279

14.4.2 Transcript quantification 280

14.4.3 Downstream analysis 282

14.5 Summary 285

15 Bisulphite Sequencing 287

15.1 Introduction 287

15.2 Alignment and methylation calls 289

15.3 Downstream analysis 290

15.4 Summary 293

16 Final Notes 295

Index 297

From the B&N Reads Blog

Customer Reviews