Practical Data Science with R, Second Edition

Summary

Practical Data Science with R, Second Edition takes a practice-oriented approach to explaining basic principles in the ever expanding field of data science. You’ll jump right to real-world use cases as you apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology

Evidence-based decisions are crucial to success. Applying the right data analysis techniques to your carefully curated business data helps you make accurate predictions, identify trends, and spot trouble in advance. The R data analysis platform provides the tools you need to tackle day-to-day data analysis and machine learning tasks efficiently and effectively.

About the book

Practical Data Science with R, Second Edition is a task-based tutorial that leads readers through dozens of useful, data analysis practices using the R language. By concentrating on the most important tasks you’ll face on the job, this friendly guide is comfortable both for business analysts and data scientists. Because data is only useful if it can be understood, you’ll also find fantastic tips for organizing and presenting data in tables, as well as snappy visualizations.

What's inside

Statistical analysis for business pros
Effective data presentation
The most useful R tools
Interpreting complicated predictive models

About the reader

You’ll need to be comfortable with basic statistics and have an introductory knowledge of R or another high-level programming language.

About the author

Nina Zumel and John Mount founded a San Francisco–based data science consulting firm. Both hold PhDs from Carnegie Mellon University and blog on statistics, probability, and computer science.

Practical Data Science with R, Second Edition

49.99 In Stock

Practical Data Science with R, Second Edition

Add to Wishlist

Practical Data Science with R, Second Edition

Paperback(2nd Edition)

$49.99

View All Available Formats & Editions

Paperback(2nd Edition)
$49.99

View All Available Formats & Editions

SHIP THIS ITEM

In stock. Ships in 1-2 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Product Details

ISBN-13:	9781617295874
Publisher:	Manning
Publication date:	12/07/2019
Edition description:	2nd Edition
Pages:	483
Product dimensions:	7.40(w) x 9.10(h) x 1.20(d)

About the Author

Nina Zumel co-founded Win-Vector, a data science consulting firm in San Francisco. She holds a PH.D. in robotics from Carnegie Mellon and was a content developer for EMC's Data Science and Big Data Analytics Training Course. Nina also contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.

John Mount co-founded Win-Vector, a data science consulting firm in San Francisco. He has a Ph.D. in computer science from Carnegie Mellon and over 15 years of applied experience in biotech research, online advertising, price optimization and finance. He contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.

Foreword xv

Preface xvi

Acknowledgments xvii

About this book xviii

About the authors xxv

About the foreword authors xxvi

About the cover illustration xxvii

Part 1 Introduction to data Science 1

1 The data science process 3

1.1 The roles in a data science project 4

Project roles 4

1.2 Stages of a data science project 6

Defining the goal 7

Data collection and management 8

Modeling 10

Model evaluation and critique 12

Presentation and documentation 14

Model deployment and maintenance 15

1.3 Setting expectations 16

Determining lower bounds on model performance 16

2 Starting with R and data 18

2.1 Starting with R 19

Installing R, tools, and examples 20

R programming 20

2.2 Working with data from files 29

Working with well-structured data from files or URLs 29

Using R with less-structured data 34

2.3 Working with relational databases 37

A production-size example 38

3 Exploring data 51

3.1 Using summary statistics to spot problems 53

Typical problems revealed by data summaries 54

3.2 Spotting problems using graphics and visualization 58

Visually checking distributions for a single variable 60

Visually checking relationships between two variables 70

4 Managing data 88

4.1 Cleaning data 88

Domain-specific data cleaning 89

Treating missing values 91

The vtreat package for automatically treating missing variables 95

4.2 Data transformations 98

Normalization 99

Centering and scaling 101

Log transformations for skewed and wide distributions 104

4.3 Sampling for modeling and validation 107

Test and training splits 108

Creating a sample group column 109

Record grouping 110

Data provenance 111

5 Data engineering and data shaping 113

5.1 Data selection 116

Subsetting rows and columns 116

Removing records with incomplete data 121

Ordering rows 124

5.2 Basic data transforms 128

Adding new columns 128

Other simple operations 133

5.3 Aggregating transforms 134

Combining many rows into summary rows 134

5.4 Multitable data transforms 137

Combining two or more ordered data frames quickly 137

Principal methods to combine data from multiple tables 143

5.5 Reshaping transforms 149

Moving data from wide to tall form 149

Moving data from tall to wide form 153

Data coordinates 158

Part 2 Modeling methods 161

6 Choosing and evaluating models 163

6.1 Mapping problems to machine learning tasks 164

Classification problems 165

Scoring problems 166

Grouping: working without known targets 167

Problem-to-method mapping 169

6.2 Evaluating models 170

Overfitting 170

Measures of model performance 174

Evaluating classification models 175

Evaluating scoring models 185

Evaluating probability models 187

6.3 Local interpretable model-agnostic explanations (LIME) for explaining model predictions 195

LIME: Automated sanity checking 197

Walking through LIME: A small example 197

LIME for text classification 204

Training the text classifier 208

Explaining the classifier's predictions 209

7 Linear and logistic regression 215

7.1 Using linear regression 216

Understanding linear regression 217

Building a linear regression model 221

Making predictions 222

Finding relations and extracting advice 228

Reading the model summary and characterizing coefficient qualify 230

Linear regression takeaways 237

7.2 Using logistic regression 237

Understanding logistic regression 237

Building a logistic regression model 242

Making predictions 243

Finding relations and extracting advice from logistic models 248

Reading the model summary and characterizing coefficients 249

Logistic regression takeaways 256

7.3 Regularization 257

An example of quasi-separation 257

The types of regularized regression 262

Regularized regression with glmnet 263

8 Advanced data preparation 274

8.1 The purpose of the vtreat package 275

8.2 KDD and KDD Cup 2009 277

Getting started with KDD Cup 2009 data 278

The bull-in-the-china-shop approach 280

8.3 Basic data preparation for classification 282

The variable score frame 284

Properly using the treatment plan 288

8.4 Advanced data preparation for classification 290

Using mkCrossFrameCExperiment() 290

Building a model 292

8.5 Preparing data for regression modeling 297

8.6 Mastering the vtreat package 299

The vtreat phases 299

Missing values 301

Indicator variables 303

Impact coding 304

The treatment plan 305

The cross-frame 306

9 Unsupervised methods 311

9.1 Cluster analysis 312

Distances 313

Preparing the data 316

Hierarchical clustering with hclust 319

The k-means algorithm 332

Assigning new points to clusters 338

Clustering takeaways 340

9.2 Association rules 340

Overview of association rules 340

The example problem 342

Mining association rules with the arules package 343

Association rule takeaways 351

10 Exploring advanced methods 353

10.1 Tree-based methods 355

A basic decision tree 356

Using bagging to improve prediction 359

Using random forests to further improve prediction 361

Gradient-boosted trees 368

Tree-based model takeaways 376

10.2 Using generalized additive models (GAMs) to learn non-monotone relationships 376

Understanding GAMs 376

A one-dimensional regression example 378

Extracting the non-linear relationships 382

Using GAM on actual data 384

Using GAM for logistic regression 387

GAM takeaways 388

10.3 Solving "inseparable" problems using support vector machines 389

Using an SVM to solve a problem 390

Understanding support vector machines 395

Understanding kernel functions 397

Support vector machine and kernel methods takeaways 399

Part 3 Working in the real world 401

11 Documentation and deployment 403

11.1 Predicting buzz 405

11.2 Using R markdown to produce milestone documentation 406

What is R markdown? 407

Knitr technical details 409

Using knitr to document the Buzz data and produce the model 411

11.3 Using comments and version control for running documentation 414

Writing effective comments 414

Using version control to record history 416

Using version control to explore your project 422

Using version control to share work 424

11.4 Deploying models 428

Deploying demonstrations using Shiny 430

Deploying models as HTTP services 431

Deploying models by export 433

What to take away 435

12 Producing effective presentations 437

12.1 Presenting your results to the project sponsor 439

Summarizing the project's goals 440

Stating the project's results 442

Filling in the details 444

Making recommendations and discussing future work 446

Project sponsor presentation takeaways 446

12.2 Presenting your model to end users 447

Summarizing the project goals 447

Showing how the model fits user workflow 448

Showing how to use the model 450

End user presentation takeaways 452

12.3 Presenting your work to other data scientists 452

Introducing the problem 452

Discussing related work 453

Discussing your approach 454

Discussing results and future work 455

Peer presentation takeaways 457

Appendix A Starting with R and other tools 459

Appendix B Important statistical concepts 484

Appendix C Bibliography 519

Index 523

From the B&N Reads Blog

Page 1 of

Practical Data Science with R, Second Edition

Practical Data Science with R, Second Edition

Paperback(2nd Edition)

Paperback(2nd Edition)

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews