Table of Contents
Preface ix
1 Design Patterns and MapReduce 1
Design Patterns 2
MapReduce History 4
MapReduce and Hadoop Refresher 4
Hadoop Example: Word Count 7
Pig and Hive 11
2 Summarization Patterns 13
Numerical Summarizations 14
Pattern Description 14
Numerical Summarization Examples 17
Inverted Index Summarizations 32
Pattern Description 32
Inverted Index Example 35
Counting with Counters 37
Pattern Description 37
Counting with Counters Example 40
3 Filtering Patterns 43
Filtering 44
Pattern Description 44
Filtering Examples 47
Bloom Filtering 49
Pattern Description 49
Bloom Filtering Examples 53
Top Ten 58
Pattern Description 58
Top Ten Examples 63
Distinct 65
Pattern Description 65
Distinct Examples 68
4 Data Organization Patterns 71
Structured to Hierarchical 72
Pattern Description 72
Structured to Hierarchical Examples 76
Partitioning 82
Pattern Description 82
Partitioning Examples 86
Binning 88
Pattern Description 88
Binning Examples 90
Total Order Sorting 92
Pattern Description 92
Total Order Sorting Examples 95
Shuffling 99
Pattern Description 99
Shuffle Examples 101
5 Join Patterns 103
A Refresher on Joins 104
Reduce Side Join 108
Pattern Description 108
Reduce Side Join Example 111
Reduce Side Join with Bloom Filter 117
Replicated Join 119
Pattern Description 119
Replicated Join Examples 121
Composite Join 123
Pattern Description 123
Composite Join Examples 126
Cartesian Product 128
Pattern Description 128
Cartesian Product Examples 132
6 Metapatterns 139
Job Chaining 139
With the Driver 140
Job Chaining Examples 141
With Shell Scripting 150
With JobControl 153
Chain Folding 158
The ChainMapper and ChainReducer Approach 163
Chain Folding Example 163
Job Merging 168
Job Merging Examples 170
7 Input and Output Patterns 177
Customizing Input and Output in Hadoop 177
InputFormat 178
RecordReader 179
OutputFormat 180
RecordWriter 181
Generating Data 182
Pattern Description 182
Generating Data Examples 184
External Source Output 189
Pattern Description 189
External Source Output Example 191
External Source Input 195
Pattern Description 195
External Source Input Example 197
Partition Pruning 202
Pattern Description 202
Partition Pruning Examples 205
8 Final Thoughts and the Future of Design Patterns 217
Trends in the Nature of Data 217
Images, Audio, and Video 217
Streaming Data 218
The Effects of Yarn 219
Patterns as a Library or Component 220
How You Can Help 220
A. Bloom Filters 221
Index 227