Moving Hadoop to the Cloud: Harnessing Cloud Features and Flexibility for Hadoop Clusters

Moving Hadoop to the Cloud: Harnessing Cloud Features and Flexibility for Hadoop Clusters

by Bill Havanki
Moving Hadoop to the Cloud: Harnessing Cloud Features and Flexibility for Hadoop Clusters

Moving Hadoop to the Cloud: Harnessing Cloud Features and Flexibility for Hadoop Clusters

by Bill Havanki


    Qualifies for Free Shipping
    Check Availability at Nearby Stores

Related collections and offers


Until recently, Hadoop deployments existed on hardware owned and run by organizations. Now, of course, you can acquire the computing resources and network connectivity to run Hadoop clusters in the cloud. But there’s a lot more to deploying Hadoop to the public cloud than simply renting machines.

This hands-on guide shows developers and systems administrators familiar with Hadoop how to install, use, and manage cloud-born clusters efficiently. You’ll learn how to architect clusters that work with cloud-provider features—not just to avoid pitfalls, but also to take full advantage of these services. You’ll also compare the Amazon, Google, and Microsoft clouds, and learn how to set up clusters in each of them.

  • Learn how Hadoop clusters run in the cloud, the problems they can help you solve, and their potential drawbacks
  • Examine the common concepts of cloud providers, including compute capabilities, networking and security, and storage
  • Build a functional Hadoop cluster on cloud infrastructure, and learn what the major providers require
  • Explore use cases for high availability, relational data with Hive, and complex analytics with Spark
  • Get patterns and practices for running cloud clusters, from designing for price and security to dealing with maintenance

Product Details

ISBN-13: 9781491959633
Publisher: O'Reilly Media, Incorporated
Publication date: 07/31/2017
Pages: 334
Product dimensions: 6.90(w) x 9.10(h) x 0.80(d)

About the Author

Bill Havanki is a software engineer working for Cloudera, where he has contributed to Hadoop components as well as systems for deploying Hadoop clusters into public Cloud services. Prior to joining Cloudera he worked for 15 years developing software for government contracts, focusing mostly on analytic frameworks and authentication and authorization systems. He earned his B.S. in Electrical Engineering from Rutgers Universityand his M.S. in Computer Engineering from North Carolina State University. A New Jersey native, he currently lives near Annapolis, Maryland with his family.

Table of Contents

Foreword xi

Preface xiii

Part I Introduction to the Cloud

1 Why Hadoop in the Cloud? 3

What Is the Cloud? 3

What Does Hadoop in the Cloud Mean? 4

Reasons to Run Hadoop in the Cloud 5

Reasons to Not Run Hadoop in the Cloud 7

What About Security? 7

Hybrid Clouds 8

Hadoop Solutions from Cloud Providers 8

Elastic MapReduce 9

Google Cloud Dataproc 10

HDInsight 10

Hadoop-Like Services 10

A Spectrum of Choices 10

Getting Started 11

2 Overview and Comparison of Cloud Providers 13

Amazon Web Services 13

References 14

Google Cloud Platform 14

References 15

Microsoft Azure 15

References 16

Which One Should You Use? 16

Part II Cloud Primer

3 Instances 21

Instance Types 22

Regions and Availability Zones 23

Instance Control 24

Temporary Instances 25

Spot instances 26

Preemptible Instances 26

Images 27

No Instance Is an Island 27

4 Networking and Security 29

A Drink of CIDR 29

Virtual Networks 30

Private DNS 32

Public IP Addresses and DNS 32

Virtual Networks and Regions 33

Routing 34

Routing in AWS 36

Routing in Google Cloud Platform 37

Routing in Azure 37

Network Security Rules 38

Inbound Versus Outbound 38

Allow Versus Deny 38

Network Security Rules in AWS 39

Network Security Rules in Google Cloud Platform 42

Network Security Rules in Azure 44

Putting Networking and Security Together 45

What About the Data? 46

5 Storage 47

Block Storage 47

Block Storage in AWS 48

Block Storage in Google Cloud Platform 48

Block Storage in Azure 49

Object Storage 49

Buckets 50

Data Objects 51

Object Access 51

Object Storage in AWS 52

Object Storage in Google Cloud Platform 53

Object Storage in Azure 53

Cloud Relational Databases 55

Cloud Relational Databases in AWS 56

Cloud Relational Databases in Google Cloud Platform 56

Cloud Relational Databases in Azure 57

Cloud NoSQL Databases 58

Where to Start? 59

Part III A Simple Cluster in the Cloud

6 Setting Up in AWS 63

Prerequisites 63

Allocating Instances 65

Generating a Key Pair 65

Launching Instances 65

Securing the Instances 71

Next Steps 71

7 Setting Up in Google Cloud Platform 73

Prerequisites 73

Creating a Project 74

Allocating Instances 75

SSH Keys 75

Creating Instances 77

Securing the Instances 84

Next Steps 85

8 Setting Up in Azure 87

Prerequisites 87

Creating a Resource Group 89

Creating Resources 90

SSH Keys 96

Creating Virtual Machines 96

The Manager Instance 96

The Worker Instances 103

Next Steps 103

9 Standing Up a Cluster 105

The JDK 105

Hadoop Accounts 106

Passwordless SSH 106

Hadoop Installation 107

HDFS and YARN Configuration 108

The Environment 108

XML Configuration Files 110

Finishing Up Configuration 112

Startup 112

SSH Tunneling 112

Running a Test Job 113

What If the Job Hangs? 114

Running Basic Data Loading and Analysis 115

Wikipedia Exports 115

Analyzing a Small Export 115

Go Bigger 121

Part IV Enhancing Your Cluster

10 High Availability 125

Planning HA in the Cloud 126



Installing and Configuring ZooKeeper 128

Adding New HDFS and YARN Daemons 130

The Second Manager 130

HDFS HA Configuration 131

YARN HA Configuration 135

Testing HA 137

Improving the HA Configuration 138

A Bigger Cluster 139

Complete HA 139

A Third Availability Zone? 139

Benchmarking HA 139

MRBench 140

Terasort 141

Grains of Salt 142

11 Relational Data with Apache Hive 145

Planning for Hive in the Cloud 145

Installing and Configuring Hive 146

Startup 149

Running Some Test Hive Queries 149

Switching to a Remote Metastore 150

The Remote Metastore and Stopped Clusters 157

Hive Control Scripts 158

Hive on S3 158

Configuring the S3 Filesystem 158

Adding Data to S3 159

Configuring S3 Authentication 161

Configuring the S3 Endpoint 164

External Table in S3 164

What About Google Cloud Platform and Azure? 165

A Step Toward Transient Clusters 165

A Different Means of Computation 166

12 Streaming in the Cloud with Apache Spark 167

Planning for Spark in the Cloud 167

Installing and Configuring Spark 168

Startup 170

Running Some Test Jobs 170

Configuring Hive on Spark 171

Add Spark Libraries to Hive 171

Configure Hive for Spark 172

Switch YARN to the Fair Scheduler 172

Try Out Hive on Spark on YARN 172

Spark Streaming from AWS Kinesis 174

Creating a Kinesis Stream 174

Populating the Stream with Data 176

Streaming Kinesis Data into Spark 178

What About Google Cloud Platform and Azure? 183

Building Clusters Versus Building Clusters Well 183

Part V Care and Feeding of Hadoop in the Cloud

13 Pricing and Performance 187

Picking Instance Types 187

The Criteria 188

General Cluster Instance Roles 188

Persistent Versus Ephemeral Block Storage 190

Stopping and Starting Entire Clusters 192

Using Temporary Instances 193

Geographic Considerations 195

Regions 195

Availability Zones 195

Performance and Networking 196

14 Network Topologies 197

Public and Private Subnets 197

SSH Tunneling 199

SOCKS Proxy 201

VPN Access 203

Access from Other Subnets 204

Cluster Topologies 204

The Public Cluster 204

The Secured Public Cluster 205

Gateway Instances 207

The Private Cluster 208

Cluster Access to the Internet and Cloud Provider Services 209

Geographic Considerations 211

Regions 211

Availability Zones 211

Starting Topologies 213

Higher-Level Planning 213

15 Patterns for Cluster Usage 215

Long-Running or Transient? 215

Single-User or Multitenant? 218

Self-Service or Managed? 219

Cloud-Only or Hybrid? 220

Watching Cost 221

The Rising Need for Automation 222

16 Using Images for Cluster Management 223

The Structure of an Image 224

EC2 Images 224

GCE Images 225

Azure Images 225

Image Preparation 225

Wait, I'm Using That! 227

Image Creation 228

Image Creation in AWS 228

Image Creation in Google Cloud Platform 229

Image Creation in Azure 231

Image Use 231

Scripting Hadoop Configuration 232

Image Maintenance 232

Image Deletion 233

Image Deletion in AWS 233

Image Deletion in Google Cloud Platform 233

Image Deletion in Azure 234

Automated Image Creation with Packer 234

Automated Cloud Cluster Creation 236

Cloudera Director 236

Hortonworks Data Cloud 236

Qubole Data Service 237

General System Management Tools 237

Images or Tools? 238

More Tooling 238

17 Monitoring and Automation 239

Monitoring Choices 239

Cloud Provider Monitoring Services 240

Rolling Your Own 241

Cloud Provider Command-Line Interfaces 241


Google Cloud Platform CLI 242

Azure CLI 244

Data Formatting for CLI Results 245

What to Monitor 245

Instance Existence 245

Instance Reachability 246

Hadoop Daemon Status 248

System Load 250

Putting Scripting to Use 253

Custom Metrics in CloudWatch 253

Basic Metrics 253

Defining a Custom Metric 254

Feeding Custom Metric Data to CloudWatch 256

Setting an Alarm on a Custom Metric 258

Elastic Compute Using a Custom Metric 260

A Custom Metric for Compute Capacity 260

Prerequisites for Autoscaling Compute 262

Triggering Autoscaling with an Alarm Action 262

What About Shrinking? 263

Other Things to Watch 264

Ingesting Logs into Cloud Watch 264

Creating an I AM User for Log Streaming 264

Installing the CloudWatch Agent 266

Creating a Metric Filter 267

Creating an Alarm from a Metric Filter 269

So Much More to See and Do 272

18 Backup and Restoration 273

Patterns to Supplement Backups 273

Backup via Imaging 274

HDFS Replication 275

Cloud Storage Filesystems 276

HDFS Snapshots 277

Hive Metastore Replication 277

Logs 278

A General Cloud Hadoop Backup Strategy 279

Not So Different, But Better 279

To the Cloud 280

A Hadoop Component Start and Stop Scripts 281

B Hadoop Cluster Configuration Scripts 285

C Monitoring Cloud Clusters with Nagios 297

Index 305

From the B&N Reads Blog

Customer Reviews