MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

by Donald Miner, Adam Shook
MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

by Donald Miner, Adam Shook

eBook

$29.99  $39.99 Save 25% Current price is $29.99, Original price is $39.99. You Save 25%.

Available on Compatible NOOK Devices and the free NOOK Apps.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Until now, design patterns for the MapReduce framework have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you’re using.

Each pattern is explained in context, with pitfalls and caveats clearly identified to help you avoid common design mistakes when modeling your big data architecture. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. All code examples are written for Hadoop.

  • Summarization patterns: get a top-level view by summarizing and grouping data
  • Filtering patterns: view data subsets such as records generated from one user
  • Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier
  • Join patterns: analyze different datasets together to discover interesting relationships
  • Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
  • Input and output patterns: customize the way you use Hadoop to load or store data

"A clear exposition of MapReduce programs for common data processing patterns—this book is indespensible for anyone using Hadoop."

--Tom White, author of Hadoop: The Definitive Guide


Product Details

ISBN-13: 9781449341985
Publisher: O'Reilly Media, Incorporated
Publication date: 11/21/2012
Sold by: Barnes & Noble
Format: eBook
Pages: 250
File size: 5 MB

About the Author

Donald Miner serves as a Solutions Architect at EMC Greenplum,advising and helping customers implement and use Greenplum's big data systems. Prior to working with Greenplum, Dr. Miner architected several large-scale and mission-critical Hadoop deployments with the U.S. Government as a contractor. He is also involved in teaching, having previously instructed industry classes on Hadoop and a variety of artificial intelligence courses at the University of Maryland, BC. Dr. Miner received his PhD from the University of Maryland, BC in Computer Science, where he focused on Machine Learning and Multi-Agent Systems in his dissertation.


Adam Shook is a Software Engineer at ClearEdge IT Solutions, LLC,working with a number of big data technologies such as Hadoop, Accumulo, Pig, and ZooKeeper. Shook graduated with a B.S. in Computer Science from the University of Maryland Baltimore County (UMBC) and took a job building a new high-performance graphics engine for a game studio. Seeking new challenges, he enrolled in the graduate program at UMBC with a focus on distributed computing technologies. He quickly found development work as a U.S. government contractor on a large-scale Hadoop deployment. Shook is involved in developing and instructing training curriculum for both Hadoop and Pig. He spends what little free time he has working on side projects and playing video games.

Table of Contents

Preface ix

1 Design Patterns and MapReduce 1

Design Patterns 2

MapReduce History 4

MapReduce and Hadoop Refresher 4

Hadoop Example: Word Count 7

Pig and Hive 11

2 Summarization Patterns 13

Numerical Summarizations 14

Pattern Description 14

Numerical Summarization Examples 17

Inverted Index Summarizations 32

Pattern Description 32

Inverted Index Example 35

Counting with Counters 37

Pattern Description 37

Counting with Counters Example 40

3 Filtering Patterns 43

Filtering 44

Pattern Description 44

Filtering Examples 47

Bloom Filtering 49

Pattern Description 49

Bloom Filtering Examples 53

Top Ten 58

Pattern Description 58

Top Ten Examples 63

Distinct 65

Pattern Description 65

Distinct Examples 68

4 Data Organization Patterns 71

Structured to Hierarchical 72

Pattern Description 72

Structured to Hierarchical Examples 76

Partitioning 82

Pattern Description 82

Partitioning Examples 86

Binning 88

Pattern Description 88

Binning Examples 90

Total Order Sorting 92

Pattern Description 92

Total Order Sorting Examples 95

Shuffling 99

Pattern Description 99

Shuffle Examples 101

5 Join Patterns 103

A Refresher on Joins 104

Reduce Side Join 108

Pattern Description 108

Reduce Side Join Example 111

Reduce Side Join with Bloom Filter 117

Replicated Join 119

Pattern Description 119

Replicated Join Examples 121

Composite Join 123

Pattern Description 123

Composite Join Examples 126

Cartesian Product 128

Pattern Description 128

Cartesian Product Examples 132

6 Metapatterns 139

Job Chaining 139

With the Driver 140

Job Chaining Examples 141

With Shell Scripting 150

With JobControl 153

Chain Folding 158

The ChainMapper and ChainReducer Approach 163

Chain Folding Example 163

Job Merging 168

Job Merging Examples 170

7 Input and Output Patterns 177

Customizing Input and Output in Hadoop 177

InputFormat 178

RecordReader 179

OutputFormat 180

RecordWriter 181

Generating Data 182

Pattern Description 182

Generating Data Examples 184

External Source Output 189

Pattern Description 189

External Source Output Example 191

External Source Input 195

Pattern Description 195

External Source Input Example 197

Partition Pruning 202

Pattern Description 202

Partition Pruning Examples 205

8 Final Thoughts and the Future of Design Patterns 217

Trends in the Nature of Data 217

Images, Audio, and Video 217

Streaming Data 218

The Effects of Yarn 219

Patterns as a Library or Component 220

How You Can Help 220

A. Bloom Filters 221

Index 227

From the B&N Reads Blog

Customer Reviews