ISBN-10:
1119632722
ISBN-13:
9781119632726
Pub. Date:
10/27/2020
Publisher:
Wiley
The Big R-Book: From Data Science to Learning Machines and Big Data / Edition 1

The Big R-Book: From Data Science to Learning Machines and Big Data / Edition 1

by Philippe J. S. De BrouwerPhilippe J. S. De Brouwer

Hardcover

Current price is , Original price is $130.0. You

Temporarily Out of Stock Online

Please check back later for updated availability.

Overview

Introduces professionals and scientists to statistics and machine learning using the programming language R

Written by and for practitioners, this book provides an overall introduction to R, focusing on tools and methods commonly used in data science, and placing emphasis on practice and business use. It covers a wide range of topics in a single volume, including big data, databases, statistical machine learning, data wrangling, data visualization, and the reporting of results. The topics covered are all important for someone with a science/math background that is looking to quickly learn several practical technologies to enter or transition to the growing field of data science.

The Big R-Book for Professionals: From Data Science to Learning Machines and Reporting with R includes nine parts, starting with an introduction to the subject and followed by an overview of R and elements of statistics. The third part revolves around data, while the fourth focuses on data wrangling. Part 5 teaches readers about exploring data. In Part 6 we learn to build models, Part 7 introduces the reader to the reality in companies, Part 8 covers reports and interactive applications and finally Part 9 introduces the reader to big data and performance computing. It also includes some helpful appendices.

  • Provides a practical guide for non-experts with a focus on business users
  • Contains a unique combination of topics including an introduction to R, machine learning, mathematical models, data wrangling, and reporting
  • Uses a practical tone and integrates multiple topics in a coherent framework
  • Demystifies the hype around machine learning and AI by enabling readers to understand the provided models and program them in R
  • Shows readers how to visualize results in static and interactive reports
  • Supplementary materials includes PDF slides based on the book’s content, as well as all the extracted R-code and is available to everyone on a Wiley Book Companion Site

The Big R-Book is an excellent guide for science technology, engineering, or mathematics students who wish to make a successful transition from the academic world to the professional. It will also appeal to all young data scientists, quantitative analysts, and analytics professionals, as well as those who make mathematical models.

Product Details

ISBN-13: 9781119632726
Publisher: Wiley
Publication date: 10/27/2020
Pages: 928
Product dimensions: 8.80(w) x 11.00(h) x 1.70(d)

About the Author

Philippe J.S. De Brouwer, PhD, is director at HSBC, guest professor at four universities (University of Warsaw, Jagiellonian University, Krakow School of Business and AGH University of Science and Technology) and honorary consul for Belgium in Krakow. As a professor, he builds bridges not only between universities and the industry, but also across disciplines. He teaches mathematicians leadership skills and non-mathematicians coding. As a scientist, he tries to combine research on financial markets, psychology, and investments to the benefit of the investor. As an honorary consul he is passionate about serving the community and helping initiatives grow.

Table of Contents

Foreword v

About the Author vii

Acknowledgements ix

Preface / Why this book? xi

Contents xv

I Introduction 1

1 The Big Picture with Kondratiev and Kardashev 3

2 The Scientific Method and Data 7

3 Conventions 13

II Starting with R and Elements of Statistics 19

4 The Basics of R 21

4.1 Variables 27

4.2 Data Types 29

4.2.1 Elementary Data Types 29

4.2.2 Vectors 30

4.2.3 Lists 33

4.2.4 Matrices 39

4.2.5 Arrays 42

4.2.6 Factors 44

4.2.7 Data Frames 48

4.3 Operators 56

4.3.1 Arithmetic Operators 56

4.3.2 Relational Operators 57

4.3.3 Logical Operators 57

4.3.4 Assignment Operators 59

4.3.5 Other Operators 60

4.3.6 Loops 62

4.3.7 Functions 66

4.3.8 Packages 70

4.3.9 Strings 73

4.4 Selected Data Interfaces 76

4.4.1 CSV Files 76

4.4.2 Excel Files 80

4.4.3 Databases 80

4.5 Distributions 83

4.5.1 Normal Distribution 83

4.5.2 Binomial Distribution 85

5 Lexical Scoping and environments 91

5.1 Environments in R 92

5.2 Lexical Scoping in R 94

6 The Implementation of OO 99

6.1 Base Types 102

6.2 S3 Objects 104

6.2.1 Creating S3 objects 107

6.2.2 Creating generic methods 109

6.2.3 Method dispatch 110

6.2.4 Group generic functions 111

6.3 S4 Objects 114

6.3.1 Creating S4 Objects 114

6.3.2 Recognising objects, generic functions, and methods 122

6.3.3 Creating S4 Generics 124

6.3.4 Method dispatch 125

6.4 The reference class, refclass, RC or R5 model 127

6.4.1 Creating R5 objects 127

6.5 OO Conclusion 134

7 Tidy R with the Tidyverse 137

7.1 The Philosophy of the Tidyverse 138

7.2 Packages in the tidyverse 141

7.3 Working with the tidyverse 144

7.3.1 tibbles 144

7.3.2 Piping with R 150

7.3.3 Attention points when using the pipe command 151

7.3.3.1 Advanced piping 153

7.3.3.2 Conclusion 155

8 Elements of Descriptive Statistics 157

8.1 Measures of Central Tendency 158

8.1.1 Mean 158

8.1.2 The Median 161

8.1.3 The Mode 162

8.2 Measures of Variation or Spread 164

8.3 Measures of Covariation 166

8.4 Chi Square Tests 169

9 Further Reading 171

III Data Import 173

10 A short history of modern database systems 175

11 RDBMS 179

12 SQL 183

12.1 Designing the database 184

12.2 Building the database 187

12.3 Adding data to the database 196

12.4 Querying the database 200

12.5 Modifying an existing database 206

12.6 Advanced features of SQL 211

13 Connecting R to an SQL database 215

IV Data Wrangling 221

14 Anonymising Data 225

15 DataWrangling in the tidyverse 229

15.1 Tidy data 230

15.2 Importing the data 232

15.2.1 Importing from an SQL RDBMS 232

15.2.2 Importing flat files in the tidyverse 234

15.2.2.1 CSV Files 236

15.2.2.2 Making sense of fixed width files 238

15.3 Tidying up data with tidyr 243

15.3.1 Splitting tables 244

15.3.2 headers to data 249

15.3.3 Spreading one column over many 250

15.3.4 separate 252

15.3.5 Unite 254

15.3.6 Wrong Data 255

15.4 Playing with tipples: SQL-like functionality 256

15.4.1 Selecting 256

15.4.2 Filtering 256

15.4.3 Joining 258

15.4.4 Mutating 262

15.4.5 Set Operations 265

15.5 String Manipulation in the tidyverse 268

15.5.1 Basic string manipulation 269

15.5.2 Pattern matching with regular expressions 272

15.5.2.1 Regular Expressions 273

15.5.2.2 Functions using Regex 279

15.6 Dates with lubridate 287

15.6.0.1 ISO 8601 Format 288

15.6.0.2 Timezones 290

15.6.0.3 Extract and set date and time components 291

15.6.0.4 Calculating with date-times 293

15.7 Factors with forcats 298

16 Dealing with missing data 307

17 Data Binning 319

17.1 Tuning the binning procedure 323

17.2 More complex cases: matrix binning 329

17.3 Weight of evidence and information value 336

18 Factoring analysis and principle components 339

18.1 Principle components analysis 340

18.2 Factor Analysis 345

V Explore Data 349

19 Using Descriptive Statistics 353

20 Standard Charts & Graphs 357

20.1 Pie Charts 358

20.2 Bar Charts 359

20.3 Boxplots 361

20.4 Violin plots 363

20.5 Histograms 366

20.6 Scatterplots 368

20.7 Line Graphs 371

20.8 Plotting Functions 373

20.9 Maps and contour plots 374

21 Elected Visualization Methods 377

21.1 Heat-maps 377

21.2 Text Mining 379

21.2.1 Word Clouds 379

21.2.2 Word Associations 383

21.3 Colours in R 386

22 Time Series Analysis 393

22.1 Time Series in R 394

22.2 Forecasting 397

22.2.1 Moving Average 397

22.2.2 Seasonal Decomposition 403

VI Modelling 409

23 Regression Models 411

23.1 Linear Regression 411

23.2 Multiple Linear Regression 415

23.2.1 Poisson Regression 416

23.2.2 Non-Linear Regression 418

23.3 Performance of regression models 421

23.3.1 Mean Square Error (MSE) 421

23.3.2 R-Squared 421

23.3.3 Mean Average Deviation (MAD) 423

24 Classification Models 425

24.1 Logistic Regression 425

24.2 The performance of binary classification models 427

24.2.1 The Confusion Matrix and related measures 428

24.2.2 ROC 431

24.2.3 AUC 433

24.2.4 AUC Gini for logistic regression 435

24.2.5 Kolmogorov-Smirnov (KS) for logistic regression 436

24.2.6 Finding an Optimal Cut-off 439

25 Learning Machines 445

25.1 Decision Tree 447

25.1.1 Essential Background 447

25.1.2 Important considerations 452

25.1.3 Growing trees with R 455

25.1.4 Evaluating the performance of a decision tree 463

25.1.4.1 The performance of the regression tree 464

25.1.4.2 The performance of the classification tree 464

25.2 Random Forest 467

25.3 Artificial Neural Networks (ANN) 472

25.3.1 The basics of ANNs in R 472

25.3.2 An example of a work-flow to develop an ANN 475

25.4 Support Vector Machine 483

25.5 Unsupervised learning and clustering 487

25.5.1 k-means clustering 488

25.5.2 Fuzzy clustering 501

25.5.3 Hierarchical clustering 504

25.5.4 Other clustering methods 506

26 Towards a tidy modelling cycle with modelr 507

27 Model Validation 513

27.1 Model quality measures 515

27.2 Predictions and residuals 516

27.3 Bootstrapping 517

27.4 Cross-Validation 520

27.4.1 training and validating 521

27.5 Monte-Carlo Cross Validation 525

27.6 k-Fold Cross Validation 527

27.7 Comparison 529

27.8 Validation in a broader perspective 530

28 Labs 535

28.1 Financial Analysis with QuantMod 535

28.1.1 The quantmod data structure 539

28.1.2 Support functions supplied by quantmod 543

28.1.3 Financial modelling in quantmod 545

29 Multi Criteria Decision Analysis (MCDA) 553

29.1 What and Why 553

29.2 GeneralWork-flow 555

29.3 Identify the issue at hand: step 1 and 2 559

29.4 STEP 3: the decision matrix 561

29.4.1 Construct a decision matrix 561

29.4.2 Normalize the decision matrix 563

29.5 STEP 4: leave out inefficient and unacceptable alternatives 565

29.5.1 Unacceptable Alternatives 565

29.5.2 Dominance— inefficient alternatives 565

29.6 Printing preference relationships 568

29.7 STEP 6: MCDA Methods 570

29.7.1 Examples of non-compensatory methods 570

29.7.2 The weighted sum method (WSM) 571

29.7.3 WPM 574

29.7.4 ELECTRE 575

29.7.4.1 ELECTRE I 576

29.7.4.2 ELECTRE II 582

29.7.5 PROMethEE 584

29.7.5.1 PROMethEE I 587

29.7.5.2 PROMethEE II 597

29.7.6 PCA (Gaia) 602

29.7.7 Outranking methods 607

29.7.8 Goal Programming 608

29.8 Summary MCDA 611

VII Introduction to Companies 613

30 Financial Accounting 617

30.1 The Statements of Accounts 618

30.1.1 Income Statement 618

30.1.2 Net Income: The P&L statement 618

30.1.3 Balance Sheet 619

30.2 The Value Chain 621

30.3 Further Terminology 623

30.4 Selected Financial Ratios 625

31 Management Accounting 627

31.1 Introduction 628

31.2 Selected Methods in MA 630

31.2.1 Cost Accounting 630

31.2.2 Selected Cost Types 632

31.3 Selected Use Cases of MA 635

31.3.1 Balanced Scorecard 635

31.3.2 Key Performance Indicators 636

31.3.2.1 Selection of KPIs 638

32 Asset Valuation Basics 641

32.1 Time Value of Money 642

32.2 Cash 645

32.3 Bonds 646

32.3.1 Valuation of Bonds 648

32.3.2 Duration 650

32.3.2.1 Macaulay Duration 651

32.3.2.2 Modified Duration 652

32.4 Equities 654

32.4.1 Valuation of Equities 655

32.4.1.1 CAPM 656

32.4.2 Absolute Value Models 660

32.4.2.1 Dividend Discount Model 660

32.4.2.2 Free Cash Flow (FCF) 664

32.4.2.3 Discounted Cash Flow Model 666

32.4.2.4 Discounted Abnormal Operating Earnings valuation model 668

32.4.2.5 Net Asset Value Method or Cost Method 668

32.4.2.6 Excess Earnings Method 670

32.4.3 Relative Value Models 670

32.4.3.1 The Idea behind Relative Value Models 670

32.4.3.2 Some Ratios that can be used in relative value models 671

32.4.3.3 Measures Related to Company Value for External Stakeholders 673

32.4.3.4 Relative Value Models in Practice 680

32.4.3.5 Conclusions and Use 680

32.4.4 Selection of Valuation Methods 681

32.4.5 Pitfalls and Matters Requiring Attention for all Methods 682

32.4.5.1 Results and Sensitivity 682

32.5 Forwards and Futures 690

32.6 Options 692

32.6.1 Definitions 692

32.6.2 Commercial Aspects 695

32.6.3 Historic observations 696

32.6.4 Valuation of Options at Maturity 697

32.6.5 The Put-Call Parity 700

32.6.6 The Black & Scholes Model 702

32.6.6.1 Apply the Black and Scholes formula 703

32.6.7 Dependencies 705

32.6.8 Sensitivities: “the Greeks” 710

32.6.9 Delta Hedging 711

32.6.10 Linear Option Strategies 714

32.6.10.1 The Limits of the Black and Scholes Model 720

32.6.11 The Binomial Model 724

32.6.11.1 Risk Neutral Method 727

32.6.11.2 The Equivalent Portfolio Binomial Model 729

32.6.11.3 Summary Binomial Model 732

32.6.12 Exotic Options 732

32.6.13 Integrated Option Strategies 733

32.6.14 Capital Protected Structures 736

VIII Report 739

33 ggplot2 743

34 R-markdown 753

35 knitr and LATEX 757

36 An automated development cycle 761

37 Writing and communication skills 763

38 Interactive apps 767

38.1 Shiny 769

38.2 Browser born data visualization 773

38.2.1 HTML-widgets 773

38.2.2 ggvis 775

38.2.3 googleVis 777

38.3 Dashboards 779

38.3.1 The business case: a diversity dashboard 780

38.3.2 A dashboard with flexdashboard 785

38.3.2.1 Interactive dashboards with flexdashboard 790

38.3.3 A dashboard with shinydashboard 791

IX Appendices 795

39 Other Resources 797

40 Levels of Measurement 799

40.1 Nominal Scale 800

40.2 Ordinal Scale 801

40.3 Interval Scale 802

40.4 Ratio Scale 803

41 Trademark Notices 805

42 Code snippets not shown in the body of the book 809

43 Answers to questions 815

Bibliography 829

Index 839

Nomenclature 851

Customer Reviews