Architecting HBase Applications: A Guidebook for Successful Development and Design

Architecting HBase Applications: A Guidebook for Successful Development and Design

by Jean-Marc Spaggiari, Kevin O'Dell

Paperback

$38.30 $39.99 Save 4% Current price is $38.3, Original price is $39.99. You Save 4%.
View All Available Formats & Editions
Choose Expedited Shipping at checkout for guaranteed delivery by Wednesday, October 16

Overview

HBase is a remarkable tool for indexing mass volumes of data, but getting started with this distributed database and its ecosystem can be daunting. With this hands-on guide, you’ll learn how to architect, design, and deploy your own HBase applications by examining real-world solutions. Along with HBase principles and cluster deployment guidelines, this book includes in-depth case studies that demonstrate how large companies solved specific use cases with HBase.

Authors Jean-Marc Spaggiari and Kevin O’Dell also provide draft solutions and code examples to help you implement your own versions of those use cases, from master data management (MDM) and document storage to near real-time event processing. You’ll also learn troubleshooting techniques to help you avoid common deployment mistakes.

  • Learn exactly what HBase does, what its ecosystem includes, and how to set up your environment
  • Explore how real-world HBase instances were deployed and put into production
  • Examine documented use cases for tracking healthcare claims, digital advertising, data management, and product quality
  • Understand how HBase works with tools and techniques such as Spark, Kafka, MapReduce, and the Java API
  • Learn how to identify the causes and understand the consequences of the most common HBase issues

Product Details

ISBN-13: 9781491915813
Publisher: O'Reilly Media, Incorporated
Publication date: 08/11/2016
Pages: 252
Sales rank: 1,217,366
Product dimensions: 7.00(w) x 9.10(h) x 0.50(d)

About the Author

Jean-Marc Spaggiari, an HBase contributor since 2012, works as an HBase specialist Solutions Architect for Cloudera to support Hadoop and HBase through technical support and consulting work. He has worked with some of the biggest HBase users in North America.

Jean-Marc’s prime role is to support HBase users over their HBase cluster deployments, upgrades, configuration and optimization, as well as to support them regarding HBase related application development. He is also a very active HBase community member, testing every release from performance and stability standpoints.
Prior to Cloudera, Jean-Marc worked as a Project Manager and as a Solution Architect for CGI and insurances companies. He has almost 20 years of Java development experience. In addition to regularly attending HBaseCon, he has spoken at various Hadoop User Group meetings and many conferences in North America, usually focusing on HBase related presentations and demonstration.

Kevin is currently a Field Engineer at Rocana where he works with customers to architect large-scale IT Operations. Prior to Rocana, Kevin worked at Cloudera for over four years where he interacted with numerous Fortune 500 companies across every vertical.

In addition, to his day to day at Rocana, Kevin works closely with the open source Apache community. He is a contributor on the Apache HBase project, has written numerous blog posts and presented at multiple conferences regarding the Hadoop ecosystem.

Table of Contents

Foreword xi

Preface xiii

Part I Introduction to HBase

1 What Is HBase? 3

Column-Oriented Versus Row-Oriented 5

Implementation and Use Cases 5

2 HBase Principles 7

Table Format 7

Table Layout 8

Table Storage 9

Internal Table Operations 15

Compaction 15

Splits (Auto-Sharding) 17

Balancing 19

Dependencies 19

HBase Roles 20

Master Server 21

Region Server 21

Thrift Server 22

REST Server 22

3 HBase Ecosystem 25

Monitoring Tools 25

Cloudera Manager 26

Apache Ambari 28

Hannibal 32

SQL 33

Apache Phoenix 33

Apache Trafodion 33

Splice Machine 34

Honorable Mentions (Kylin, Themis, Tephra, Hive, and Impala) 34

Frameworks 35

OpenTSDB 35

Kite 36

HappyBase 37

AsyncHBase 37

4 HBase Sizing and Tuning Overview 39

Hardware 40

Storage 40

Networking 41

OS Tuning 42

Hadoop Tuning 43

HBase Tuning 44

Different Workload Tuning 46

5 Environment Setup 49

System Requirements 50

Operating System 50

Virtual Machine 50

Resources 52

Java 53

HBase Standalone Installation 53

HBase in a VM 56

Local Versus VM 57

Local Mode 57

Virtual Linux Environment 58

QuickStart VM (or Equivalent) 58

Troubleshooting 59

IP/Name Configuration 59

Access to the /tmp Folder 59

Environment Variables 59

Available Memory 59

First Steps 60

Basic Operations 61

Import Code Examples 62

Testing the Examples 66

Pseudodistributed and Fully Distributed 68

Part II Use Cases

6 Use Case: HBase as a System of Record 73

Ingest/Pre-Processing 74

Processing/Serving 75

User Experience 79

7 Implementation of an Underlying Storage Engine 83

Table Design 83

Table Schema 84

Table Parameters 85

Implementation 87

Data conversion 88

Generate Test Data 88

Create Avro Schema 89

Implement MapReduce Transformation 89

HFile Validation 94

Bulk Loading 95

Data Validation 96

Table Size 97

File Content 98

Data Indexing 100

Data Retrieval 104

Going Further 105

8 Use Case: Near Real-Time Event Processing 107

Ingest/Pre-Processing 110

Near Real-Time Event Processing 111

Processing/Serving 112

9 Implementation of Near Real-Time Event Processing 115

Application Flow 117

Kafka 117

Flume 118

HBase 118

Lily 120

Solr 120

Implementation 121

Data Generation 121

Kafka 122

Flume 123

Serializer 130

HBase 134

Lily 136

Solr 138

Testing 139

Going Further 140

10 Use Case: HBase as a Master Data Management Tool 141

Ingest 142

Processing 143

11 Implementation of HBase as a Master Data Management Tool 147

MapReduce Versus Spark 147

Get Spark Interacting with HBase 148

Run Spark over an HBase Table 148

Calling HBase from Spark 148

Implementing Spark with HBase 149

Spark and HBase: Puts 150

Spark on HBase: Bulk Load 154

Spark Over HBase 156

Going Further 160

12 Use Case: Document Store 161

Serving 163

Ingest 164

Clean Up 166

13 Implementation of Document Store 167

MOBs 167

Storage 169

Usage 170

Too Big 170

Consistency 172

Going Further 173

Part III Troubleshooting

14 Too Many Regions 177

Consequences 177

Causes 178

Misconfiguration 178

Misoperation 179

Solution 179

Before 0.98 179

Starting with 0.98 185

Prevention 187

Regions Size 187

Key and Table Design 189

15 Too Many Column Families 191

Consequences 192

Memory 192

Compactions 193

Split 193

Causes, Solution, and Prevention 193

Delete a Column Family 194

Merge a Column Family 194

Separate a Column Family into a New Table 196

16 Hotspotting 197

Consequences 197

Causes 198

Monotonically Incrementing Keys 198

Poorly Distributed Keys 198

Small Reference Tables 199

Applications' Issues 200

Meta Region Hotspotting 200

Prevention and Solution 200

17 Timeouts and Garbage Collection 203

Consequences 203

Causes 206

Storage Failure 206

Power-Saving Features 206

Network Failure 207

Solutions 207

Prevention 207

Reduce Heap Size 208

Off-Heap BlockCache 208

Using the GIGC Algorithm 209

Configure Swappiness to 0 or 1 210

Disable Environment-Friendly Features 211

Hardware Duplication 211

18 HBCK and Inconsistencies 213

HBase Filesystem Layout 213

Reading META 214

Reading HBase on HDFS 215

General HBCK Overview 217

Using HBCK 218

Index 223

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews