Data Analytics with Hadoop: An Introduction for Data Scientists

Data Analytics with Hadoop: An Introduction for Data Scientists

by Benjamin Bengfort, Jenny Kim


View All Available Formats & Editions
Choose Expedited Shipping at checkout for delivery by Wednesday, August 4


Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.

Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.

  • Understand core concepts behind Hadoop and cluster computing
  • Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
  • Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
  • Use Sqoop and Apache Flume to ingest data from relational databases
  • Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
  • Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib

Related collections and offers

Product Details

ISBN-13: 9781491913703
Publisher: O'Reilly Media, Incorporated
Publication date: 06/23/2016
Pages: 288
Product dimensions: 6.90(w) x 9.10(h) x 0.70(d)

About the Author

Benjamin Bengfort is a Data Scientist who lives inside the beltway but ignores politics (the normal business of DC) favoring technology instead. He is currently working to finish his PhD at the University of Maryland where he studies machine learning and distributed computing. His lab does have robots (though this field of study is not one he favors) and, much to his chagrin, they seem to constantly arm said robots with knives and tools; presumably to pursue culinary accolades. Having seen a robot attempt to slice a tomato, Benjamin prefers his own adventures in the kitchen where he specializes in fusion French and Guyanese cuisine as well as BBQ of all types. A professional programmer by trade, a Data Scientist by vocation, Benjamin's writing pursues a diverse range of subjects from Natural Language Processing, to Data Science with Python to analytics with Hadoop and Spark.

Jenny Kim is an experienced big data engineer who works in both commercial software efforts as well as in academia. She has significant experience in working with large scale data, machine learning, and Hadoop implementations in production and research environments. Jenny (with Benjamin Bengfort) previously built a large scale recommender system that used a web crawler to gather ontological information about apparel products and produce recommendations from transactions. Currently, she is working with the Hue team at Cloudera, to help build intuitive interfaces for analyzing big data with Hadoop.

Table of Contents

Preface vii

Part I Introduction to Distributed Computing

1 The Age of the Data Product 3

What Is a Data Product? 4

Building Data Products at Scale with Hadoop 5

Leveraging Large Datasets 6

Hadoop for Data Products 7

The Data Science Pipeline and the Hadoop Ecosystem 8

Big Data Workflows 10

Conclusion 11

2 An Operating System for Big Data 13

Basic Concepts 14

Hadoop Architecture 15

A Hadoop Cluster 17



Working with a Distributed File System 22

Basic File System Operations 23

File Permissions in HDFS 25

Other HDFS Interfaces 26

Working with Distributed Computation 27

MapReduce: A Functional Programming Model 28

MapReduce: Implemented on a Cluster 30

Beyond a Map and Reduce: Job Chaining 37

Submitting a MapReduce Job to YARN 38

Conclusion 40

3 A Framework for Python and Hadoop Streaming 41

Hadoop Streaming 42

Computing on CSV Data with Streaming 45

Executing Streaming Jobs 50

A Framework for MapReduce with Python 52

Counting Bigrams 55

Other Frameworks 59

Advanced MapReduce 60

Combiners 60

Partitioners 61

Job Chaining 62

Conclusion 65

4 In-Memory Computing with Spark 67

Spark Basics 68

The Spark Stack 70

Resilient Distributed Datasets 72

Programming with RDDs 73

Interactive Spark Using PySpark 77

Writing Spark Applications 79

Visualizing Airline Delays with Spark 81

Conclusion 87

5 Distributed Analysis and Patterns 89

Computing with Keys 91

Compound Keys 92

Keyspace Patterns 96

Pairs versus Stripes 100

Design Patterns 104

Summarization 105

Indexing 110

Filtering 117

Toward Last-Mile Analytics 123

Fitting a Model 124

Validating Models 125

Conclusion 127

Part II Workflows and Tools for Big Data Science

6 Data Mining and Warehousing 131

Structured Data Queries with Hive 132

The Hive Command-Line Interface (CLI) 133

Hive Query Language (HQL) 134

Data Analysis with Hive 139

HBase 144

NoSQL and Column-Oriented Databases 145

Real-Time Analytics with HBase 148

Conclusion 156

7 Data Ingestion 157

Importing Relational Data with Sqoop 158

Importing from MySQL to HDFS 159

Importing from MySQL to Hive 161

Importing from MySQL to HBase 163

Ingesting Streaming Data with Flume 165

Flume Data Flows 166

Ingesting Product Impression Data with Flume 169

Conclusion 173

8 Analytics with Higher-Level APIs 175

Pig 175

Pig Latin 177

Data Types 181

Relational Operators 182

User-Defined Functions 182

Wrapping Up 184

Sparks Higher-Level APIs 184

Spark SQL 186

DataFrames 189

Conclusion 195

9 Machine Learning 197

Scalable Machine Learning with Spark 197

Collaborative Filtering 199

Classification 206

Clustering 208

Conclusion 212

10 Summary: Doing Distributed Data Science 213

Data Product Lifecycle 214

Data Lakes 216

Data Ingestion 218

Computational Data Stores 220

Machine Learning Lifecycle 222

Conclusion 224

A Creating a Hadoop Pseudo-Distributed Development Environment 227

B Installing Hadoop Ecosystem Products 237

Glossary 247

Index 263

Customer Reviews