Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on text, CSV, HTML, XML, and JSON files
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow
  • Create your own tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Model data with dimensionality reduction, regression, and classification algorithms
  • Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark
1138839077
Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on text, CSV, HTML, XML, and JSON files
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow
  • Create your own tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Model data with dimensionality reduction, regression, and classification algorithms
  • Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark
56.99 In Stock
Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

by Jeroen Janssens
Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

by Jeroen Janssens

eBook

$56.99 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on text, CSV, HTML, XML, and JSON files
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow
  • Create your own tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Model data with dimensionality reduction, regression, and classification algorithms
  • Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark

Product Details

ISBN-13: 9781492087861
Publisher: O'Reilly Media, Incorporated
Publication date: 08/17/2021
Sold by: Barnes & Noble
Format: eBook
Pages: 282
File size: 5 MB

About the Author

Jeroen Janssens teaches data science; often through training and coaching, occasionally through speaking, and infrequently through writing. His interests include visualizing data, building machine learning models, and automating things using either Python, R, or Bash. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. Currently, Jeroen is the CEO of Data Science Workshops, which organises open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. All related to data science of course. He lives with his wife and two kids in Rotterdam, the Netherlands.

Table of Contents

Foreword xiii

Preface xv

1 Introduction 1

Data Science Is OSEMN 2

Obtaining Data 3

Scrubbing Data 3

Exploring Data 3

Modeling Data 4

Interpreting Data 4

Intermezzo Chapters 4

What Is the Command Line? 5

Why Data Science at the Command Line? 7

The Command Line Is Agile 7

The Command Line Is Augmenting 8

The Command Line Is Scalable 8

The Command Line Is Extensible 9

The Command Line Is Ubiquitous 9

Summary 10

For Further Exploration 10

2 Getting Started 11

Getting the Data 11

Installing the Docker Image 12

Essential Unix Concepts 13

The Environment 14

Executing a Command-Line Tool 15

Five Types of Command-Line Tools 16

Combining Command-Line Tools 20

Redirecting Input and Output 22

Working with Files and Directories 26

Managing Output 28

Help! 30

Summary 33

For Further Exploration 33

3 Obtaining Data 35

Overview 36

Copying Local Files to the Docker Container 36

Downloading from the Internet 37

Introducing curl 37

Saving 38

Other Protocols 39

Following Redirects 39

Decompressing Files 41

Converting Microsoft Excel Spreadsheets to CSV 43

Querying Relational Databases 46

Calling Web APIs 47

Authentication 48

Streaming APIs 50

Summary 52

For Further Exploration 52

4 Creating Command-Line Tools 53

Overview 54

Converting One-Liners into Shell Scripts 55

Step 1 Create a File 58

Step 2 Give Permission to Execute 61

Step 3 Define a Shebang 62

Step 4 Remove the Fixed Input 65

Step 5 Add Arguments 66

Step 6 Extend Your PATH 68

Creating Command-Line Tools with Python and R 69

Porting the Shell Script 70

Processing Streaming Data from Standard Input 72

Summary 74

For Further Exploration 74

5 Scrubbing Data 77

Overview 78

Transformations, Transformations Everywhere 78

Plain Text 81

Filtering Lines 81

Extracting Values 86

Replacing and Deleting Values 88

CSV 90

Bodies and Headers and Columns, Oh My! 90

Performing SQL Queries on CSV 93

Extracting and Reordering Columns 94

Filtering Rows 95

Merging Columns 96

Combining Multiple CSV Files 99

Working with XML/HTML and JSON 101

Summary 104

For Further Exploration 105

6 Project Management with Make 107

Overview 108

Introducing Make 109

Running Tasks 109

Building, for Real 112

Adding Dependencies 113

Summary 118

For Further Exploration 118

7 Exploring Data 119

Overview 120

Inspecting Data and Its Properties 120

Header or Not, Here I Come 120

Inspect All the Data 121

Feature Names and Data Types 122

Unique Identifiers, Continuous Variables, and Factors 124

Computing Descriptive Statistics 126

Column Statistics 126

R One-Liners on the Shell 129

Creating Visualizations 133

Displaying Images from the Command Line 133

Plotting in a Rush 138

Creating Bar Charts 140

Creating Histograms 142

Creating Density Plots 143

Happy Little Accidents 144

Creating Scatter Plots 146

Creating Trend Lines 147

Creating Box Plots 149

Adding Labels 150

Going Beyond Basic Plots 152

Summary 152

For Further Exploration 152

8 Parallel Pipelines 153

Overview 154

Serial Processing 154

Looping Over Numbers 155

Looping Over Lines 156

Looping Over Files 157

Parallel Processing 158

Introducing GNU Parallel 160

Specifying Input 162

Controlling the Number of Concurrent Jobs 164

Logging and Output 164

Creating Parallel Tools 166

Distributed Processing 167

Get List of Running AWS EC2 Instances 167

Running Commands on Remote Machines 169

Distributing Local Data Among Remote Machines 170

Processing Files on Remote Machines 171

Summary 174

For Further Exploration 175

9 Modeling Data 177

Overview 178

More Wine, Please! 178

Dimensionality Reduction with Tapkee 182

Introducing Tapkee 183

Linear and Nonlinear Mappings 183

Regression with Vowpal Wabbit 187

Preparing the Data 187

Training the Model 188

Testing the Model 190

Classification with SciKit-Learn Laboratory 193

Preparing the Data 193

Running the Experiment 194

Parsing the Results 195

Summary 197

For Further Exploration 198

10 Polyglot Data Science 199

Overview 200

Jupyter 200

Python 203

R 205

RStudio 207

Apache Spark 208

Summary 210

For Further Exploration 211

11 Condusion 213

Let's Recap 213

Three Pieces of Advice 214

Be Patient 214

Be Creative 215

Be Practical 215

Where to Go from Here 215

The Command Line 216

Shell Programming 216

Python, R, and SQL 216

APIs 216

Machine Learning 217

Getting in Touch 217

List of Command-Line Tools 219

Index 249

From the B&N Reads Blog

Customer Reviews