Think Like a Data Scientist: Tackle the data science process step-by-step

Think Like a Data Scientist: Tackle the data science process step-by-step

by Brian Godsey


Eligible for FREE SHIPPING
  • Want it by Monday, October 22?   Order by 12:00 PM Eastern and choose Expedited Shipping at checkout.
    Same Day shipping in Manhattan. 
    See Details


Think Like a Data Scientist: Tackle the data science process step-by-step by Brian Godsey


Think Like a Data Scientist presents a step-by-step approach to data science, combining analytic, programming, and business perspectives into easy-to-digest techniques and thought processes for solving real world data-centric problems.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the Technology

Data collected from customers, scientific measurements, IoT sensors, and so on is valuable only if you understand it. Data scientists revel in the interesting and rewarding challenge of observing, exploring, analyzing, and interpreting this data. Getting started with data science means more than mastering analytic tools and techniques, however; the real magic happens when you begin to think like a data scientist. This book will get you there.

About the Book

Think Like a Data Scientist teaches you a step-by-step approach to solving real-world data-centric problems. By breaking down carefully crafted examples, you'll learn to combine analytic, programming, and business perspectives into a repeatable process for extracting real knowledge from data. As you read, you'll discover (or remember) valuable statistical techniques and explore powerful data science software. More importantly, you'll put this knowledge together using a structured process for data science. When you've finished, you'll have a strong foundation for a lifetime of data science learning and practice.

What's Inside

  • The data science process, step-by-step
  • How to anticipate problems
  • Dealing with uncertainty
  • Best practices in software and scientific thinking

About the Reader

Readers need beginner programming skills and knowledge of basic statistics.

About the Author

Brian Godsey has worked in software, academia, finance, and defense and has launched several data-centric start-ups.

Table of Contents


  1. Philosophies of data science
  2. Setting goals by asking good questions
  3. Data all around us: the virtual wilderness
  4. Data wrangling: from capture to domestication
  5. Data assessment: poking and prodding

  7. Developing a plan
  8. Statistics and modeling: concepts and foundations
  9. Software: statistics in action
  10. Supplementary software: bigger, faster, more efficient
  11. Plan execution: putting it all together

  13. Delivering a product
  14. After product delivery: problems and revisions
  15. Wrapping up: putting the project away

Product Details

ISBN-13: 9781633430273
Publisher: Manning Publications Company
Publication date: 03/31/2017
Pages: 328
Sales rank: 375,381
Product dimensions: 7.30(w) x 9.10(h) x 0.90(d)

About the Author

Brian Godsey holds a PhD in applied mathematics, is active in the academic community, and has been developing statistical software for over 10 years. In the last few years, he has been involved in startups as a co-founder, adviser, and team member.

Table of Contents

Preface xv

Acknowledgments xvi

About this book xvii

About the cover illustration xxi

Part 1 Preparing and Gathering Data and Knowledge 1

1 Philosophies of data science 3

1.1 Data science and this book 5

1.2 Awareness is valuable 7

1.3 Developer vs. data scientist 8

1.4 Do I need to be a software developer? 10

1.5 Do I need to know statistics? 11

1.6 Priorities: knowledge first, technology second, opinions third 12

1.7 Best practices 13

Documentation 14

Code repositories and versioning 14

Code organization 15

Ask questions 16

Stay close to the data 17

1.8 Reading this book: how I discuss concepts 17

2 Setting goals by asking good questions 19

2.1 Listening to the customer 20

Resolving wishes and pragmatism 20

The customer is probably not a data scientist 22

Asking specific questions to uncover fact, not opinions 23

Suggesting deliverables: guess and check 24

Iterate your ideas based on knowledge, not wishes 25

2.2 Ask good questions-of the data 26

Good questions are concrete in their assumptions 27

Good answers: measurable success without too much cost 29

2.3 Answering the question using data 30

Is the data relevant and sufficient? 31

Has someone done this before? 32

Figuring out what data and software you could use 32

Anticipate obstacles to getting everything you want 33

2.4 Setting goals 34

What is possible? 34

What is valuable? 34

What is efficient? 35

2.5 Planning: be flexible 35

3 Data all around us: the virtual wilderness 37

3.1 Data as the object of study 37

The users of computers and the internet became data generators 38

Data for its own sake 40

Data scientist as explorer 41

3.2 Where data might live, and how to interact with it 44

Flat files 45


XML 48


Relational databases 50

Non-relational databases 52

APIs 52

Common bad formats 54

Unusual formats 55

Deciding which format to use 55

3.3 Scouting for data 56

First step: Google search 57

Copyright and licensing 57

The data you have: is it enough? 58

Combining data sources 59

Web scraping 60

Measuring or collecting things yourself 61

3.4 Example: microRNA and gene expression 62

4 Data wrangling: from capture to domestication 67

4.1 Case study: best all-time performances in track and field 68

Common heuristic comparisons 68

IAAF Scoring Tables 69

Comparing performances using all data available 70

4.2 Getting ready to wrangle 70

Some types of messy data 71

Pretend you're an algorithm 71

Keep imagining: what are the possible obstacles and uncertainties? 73

Look at the end of the data and, the file 74

Make a plan 75

4.3 Techniques and tools 76

File format converters 76

Proprietary data wranglers 77

Scripting: use the plan, but then guess and check 77

4.4 Common pitfalls 78

Watch out for Windows/Mac/Linux problems 78

Escape characters 79

The outliers 82

Horror stories around the wranglers' campfire 82

5 Data assessment: poking and prodding 84

5.1 Example: the Enron email data set 85

5.2 Descriptive statistics 86

Stay close to the data 87

Common descriptive statistics 88

Choosing specific statistics to calculate 89

Make tables or graphs where appropriate 91

5.3 Check assumptions about the data 92

Assumptions about the contents of the data 92

Assumptions about the distribution of the data 92

A handy trick for uncovering your assumptions 94

5.4 Looking for something specific 95

Find a few examples 95

Characterize the examples: what makes them different? 96

Data snooping (or not) 98

5.5 Rough statistical analysis 99

Dumb it down 99

Take a subset of the data 102

Increasing sophistication: does it improve results? 103

Part 2 Building a Product with Software and Statistics 105

6 Developing a plan 107

6.1 What have you learned? 109

Examples 109

Evaluating what you've learned 112

6.2 Reconsidering expectations and goals 113

Unexpected new information 114

Adjusting goals 116

Consider more exploratory work 117

6.3 Planning 117

Examples 118

6.4 Communicating new goals 127

7 Statistics and modeling: concepts and foundations 129

7.1 How I think about statistics 130

7.2 Statistics: the field as it relates to data science 131

What statistics is 131

What statistic is not 132

7.3 Mathematics 134

Example: long division 134

Mathematical models 137

Mathematics vs. statistics 140

7.4 Statistical modeling and inference 141

Defining a statistical model 142

Latent variables 143

Quantifying uncertainty: randomness, variance, and error terms 144

Fitting a model 148

Bayesian vs. frequentist statistics 153

Drawing conclusions from models 156

7.5 Miscellaneous statistical methods 159

Clustering 159

Component, analysis 160

Machine learning and black box methods 162

8 Software: statistics in action 166

8.1 Spreadsheets and GUI-based applications 167

Spreadsheets 167

Other GUI-based statistical applications 171

Data science for the masses 172

8.2 Programming 173

Getting started with programming 174

Languages 182

8.3 Choosing statistical software tools 191

Does the tool have an implementation of the methods? 191

Flexibility is good 192

Informative is good 192

Common is good 192

Well documented is good 193

Purpose-built is good 193

Interoperability is good 194

Permissive licenses are good 194

Knowledge and familiarity are good 195

8.4 Translating statistics into software 195

Using built-in method 195

Writing your own methods 199

9 Supplementary software: bigger, faster, more efficient 201

9.1 Databases 202

Types of databases 202

Benefits of databases 204

How to use database's 206

When to use databases 206

9.2 High-performance computing 207

Types of HPC 207

Benefits of HPC 208

How to use HFC 208

When to use HPC 209

9.3 Cloud services 209

Types of cloud services 209

Benefits of cloud services 210

How to use cloud services 210

When to use cloud services 210

9.4 Big data technologies 211

Types of big data technologies 212

Benefits of big data technologies 213

How to use big data technologies 213

When to use big data technologies 213

9.5 Anything ay a service 213

10 Plan execution: putting it all together 215

10.1 Tips for executing the plan 216

If you're a statistician 216

If you're a software engineer 218

If you're a beginner 219

If you're a member of a team 219

If you're leading a team 220

10.2 Modifying the plan in progress 221

Sometimes the goals change 222

Something might be more difficult than you thought 222

Sometimes you realize you made a bad choice 223

10.3 Results: knowing when they're good enough 223

Statistical significance 223

Practical usefulness 224

Reevaluating your original accuracy and significance goals 225

10.4 Case study: protocols for measurement of gene activity 227

The project 227

What I knew 228

What I needed to learn 228

The resources 229

The statistical model 229

The software 232

The plan 232

The results 233

Submitting for publication and feedback 234

How it ended 235

Part 3 Finishing Off the Product and Wrapping Up 237

11 Delivering a product 239

11.1 Understanding your customer 240

Who is the entire audience for the results? 240

What will be done with the results? 241

11.2 Delivery media 242

Report or white paper 242

Analytical tool 243

Interactive graphical application 245

Instructions for how to redo the analysis 247

Other types of products 248

11.3 Content 249

Make important, conclusive results prominent 249

Don't include results that are virtually inconclusive 249

Include, obvious disclaimers for less significant results 250

User experience 250

11.4 Example: analyzing video game play 253

12 After product delivery: problems and revisions 256

12.1 Problems with the product and its use 257

Customers not using the product correctly 257

UX problems 259

Software bugs 261

The product doesn't solve real problems 262

12.2 Feedback 264

Feedback means someone is using your product 264

Feedback is not disapproval 264

Read between the lines 265

Ask for feedback if you must 267

12.3 Product revisions 268

Uncertainty can make revisions necessary 268

Designing revisions 269

Engineering revisions 270

Deciding which revisions to make 272

13 Wrapping up: putting the project away 274

13.1 Putting the project away neatly 275

Documentation 276

Stowage 278

Thinking ahead to future scenarios 281

Best practices 283

13.2 Learning from the project 284

Project postmortem 284

13.3 Looking toward the future 287

Exercises: examples and answers 290

Index 299

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews