
Moving Hadoop to the Cloud: Harnessing Cloud Features and Flexibility for Hadoop Clusters
338
Moving Hadoop to the Cloud: Harnessing Cloud Features and Flexibility for Hadoop Clusters
338Paperback
-
SHIP THIS ITEMQualifies for Free ShippingChoose Expedited Shipping at checkout for delivery by Thursday, June 8PICK UP IN STORE
Sorry! Store Pickup is currently unavailable.
Available within 2 business hours
Overview
This hands-on guide shows developers and systems administrators familiar with Hadoop how to install, use, and manage cloud-born clusters efficiently. You’ll learn how to architect clusters that work with cloud-provider features—not just to avoid pitfalls, but also to take full advantage of these services. You’ll also compare the Amazon, Google, and Microsoft clouds, and learn how to set up clusters in each of them.
- Learn how Hadoop clusters run in the cloud, the problems they can help you solve, and their potential drawbacks
- Examine the common concepts of cloud providers, including compute capabilities, networking and security, and storage
- Build a functional Hadoop cluster on cloud infrastructure, and learn what the major providers require
- Explore use cases for high availability, relational data with Hive, and complex analytics with Spark
- Get patterns and practices for running cloud clusters, from designing for price and security to dealing with maintenance
Related collections and offers
Product Details
ISBN-13: | 9781491959633 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 07/31/2017 |
Pages: | 338 |
Product dimensions: | 6.90(w) x 9.10(h) x 0.80(d) |
About the Author
Table of Contents
Foreword xi
Preface xiii
Part I Introduction to the Cloud
1 Why Hadoop in the Cloud? 3
What Is the Cloud? 3
What Does Hadoop in the Cloud Mean? 4
Reasons to Run Hadoop in the Cloud 5
Reasons to Not Run Hadoop in the Cloud 7
What About Security? 7
Hybrid Clouds 8
Hadoop Solutions from Cloud Providers 8
Elastic MapReduce 9
Google Cloud Dataproc 10
HDInsight 10
Hadoop-Like Services 10
A Spectrum of Choices 10
Getting Started 11
2 Overview and Comparison of Cloud Providers 13
Amazon Web Services 13
References 14
Google Cloud Platform 14
References 15
Microsoft Azure 15
References 16
Which One Should You Use? 16
Part II Cloud Primer
3 Instances 21
Instance Types 22
Regions and Availability Zones 23
Instance Control 24
Temporary Instances 25
Spot instances 26
Preemptible Instances 26
Images 27
No Instance Is an Island 27
4 Networking and Security 29
A Drink of CIDR 29
Virtual Networks 30
Private DNS 32
Public IP Addresses and DNS 32
Virtual Networks and Regions 33
Routing 34
Routing in AWS 36
Routing in Google Cloud Platform 37
Routing in Azure 37
Network Security Rules 38
Inbound Versus Outbound 38
Allow Versus Deny 38
Network Security Rules in AWS 39
Network Security Rules in Google Cloud Platform 42
Network Security Rules in Azure 44
Putting Networking and Security Together 45
What About the Data? 46
5 Storage 47
Block Storage 47
Block Storage in AWS 48
Block Storage in Google Cloud Platform 48
Block Storage in Azure 49
Object Storage 49
Buckets 50
Data Objects 51
Object Access 51
Object Storage in AWS 52
Object Storage in Google Cloud Platform 53
Object Storage in Azure 53
Cloud Relational Databases 55
Cloud Relational Databases in AWS 56
Cloud Relational Databases in Google Cloud Platform 56
Cloud Relational Databases in Azure 57
Cloud NoSQL Databases 58
Where to Start? 59
Part III A Simple Cluster in the Cloud
6 Setting Up in AWS 63
Prerequisites 63
Allocating Instances 65
Generating a Key Pair 65
Launching Instances 65
Securing the Instances 71
Next Steps 71
7 Setting Up in Google Cloud Platform 73
Prerequisites 73
Creating a Project 74
Allocating Instances 75
SSH Keys 75
Creating Instances 77
Securing the Instances 84
Next Steps 85
8 Setting Up in Azure 87
Prerequisites 87
Creating a Resource Group 89
Creating Resources 90
SSH Keys 96
Creating Virtual Machines 96
The Manager Instance 96
The Worker Instances 103
Next Steps 103
9 Standing Up a Cluster 105
The JDK 105
Hadoop Accounts 106
Passwordless SSH 106
Hadoop Installation 107
HDFS and YARN Configuration 108
The Environment 108
XML Configuration Files 110
Finishing Up Configuration 112
Startup 112
SSH Tunneling 112
Running a Test Job 113
What If the Job Hangs? 114
Running Basic Data Loading and Analysis 115
Wikipedia Exports 115
Analyzing a Small Export 115
Go Bigger 121
Part IV Enhancing Your Cluster
10 High Availability 125
Planning HA in the Cloud 126
HDFS HA 126
YARN HA 128
Installing and Configuring ZooKeeper 128
Adding New HDFS and YARN Daemons 130
The Second Manager 130
HDFS HA Configuration 131
YARN HA Configuration 135
Testing HA 137
Improving the HA Configuration 138
A Bigger Cluster 139
Complete HA 139
A Third Availability Zone? 139
Benchmarking HA 139
MRBench 140
Terasort 141
Grains of Salt 142
11 Relational Data with Apache Hive 145
Planning for Hive in the Cloud 145
Installing and Configuring Hive 146
Startup 149
Running Some Test Hive Queries 149
Switching to a Remote Metastore 150
The Remote Metastore and Stopped Clusters 157
Hive Control Scripts 158
Hive on S3 158
Configuring the S3 Filesystem 158
Adding Data to S3 159
Configuring S3 Authentication 161
Configuring the S3 Endpoint 164
External Table in S3 164
What About Google Cloud Platform and Azure? 165
A Step Toward Transient Clusters 165
A Different Means of Computation 166
12 Streaming in the Cloud with Apache Spark 167
Planning for Spark in the Cloud 167
Installing and Configuring Spark 168
Startup 170
Running Some Test Jobs 170
Configuring Hive on Spark 171
Add Spark Libraries to Hive 171
Configure Hive for Spark 172
Switch YARN to the Fair Scheduler 172
Try Out Hive on Spark on YARN 172
Spark Streaming from AWS Kinesis 174
Creating a Kinesis Stream 174
Populating the Stream with Data 176
Streaming Kinesis Data into Spark 178
What About Google Cloud Platform and Azure? 183
Building Clusters Versus Building Clusters Well 183
Part V Care and Feeding of Hadoop in the Cloud
13 Pricing and Performance 187
Picking Instance Types 187
The Criteria 188
General Cluster Instance Roles 188
Persistent Versus Ephemeral Block Storage 190
Stopping and Starting Entire Clusters 192
Using Temporary Instances 193
Geographic Considerations 195
Regions 195
Availability Zones 195
Performance and Networking 196
14 Network Topologies 197
Public and Private Subnets 197
SSH Tunneling 199
SOCKS Proxy 201
VPN Access 203
Access from Other Subnets 204
Cluster Topologies 204
The Public Cluster 204
The Secured Public Cluster 205
Gateway Instances 207
The Private Cluster 208
Cluster Access to the Internet and Cloud Provider Services 209
Geographic Considerations 211
Regions 211
Availability Zones 211
Starting Topologies 213
Higher-Level Planning 213
15 Patterns for Cluster Usage 215
Long-Running or Transient? 215
Single-User or Multitenant? 218
Self-Service or Managed? 219
Cloud-Only or Hybrid? 220
Watching Cost 221
The Rising Need for Automation 222
16 Using Images for Cluster Management 223
The Structure of an Image 224
EC2 Images 224
GCE Images 225
Azure Images 225
Image Preparation 225
Wait, I'm Using That! 227
Image Creation 228
Image Creation in AWS 228
Image Creation in Google Cloud Platform 229
Image Creation in Azure 231
Image Use 231
Scripting Hadoop Configuration 232
Image Maintenance 232
Image Deletion 233
Image Deletion in AWS 233
Image Deletion in Google Cloud Platform 233
Image Deletion in Azure 234
Automated Image Creation with Packer 234
Automated Cloud Cluster Creation 236
Cloudera Director 236
Hortonworks Data Cloud 236
Qubole Data Service 237
General System Management Tools 237
Images or Tools? 238
More Tooling 238
17 Monitoring and Automation 239
Monitoring Choices 239
Cloud Provider Monitoring Services 240
Rolling Your Own 241
Cloud Provider Command-Line Interfaces 241
AWS CLI 242
Google Cloud Platform CLI 242
Azure CLI 244
Data Formatting for CLI Results 245
What to Monitor 245
Instance Existence 245
Instance Reachability 246
Hadoop Daemon Status 248
System Load 250
Putting Scripting to Use 253
Custom Metrics in CloudWatch 253
Basic Metrics 253
Defining a Custom Metric 254
Feeding Custom Metric Data to CloudWatch 256
Setting an Alarm on a Custom Metric 258
Elastic Compute Using a Custom Metric 260
A Custom Metric for Compute Capacity 260
Prerequisites for Autoscaling Compute 262
Triggering Autoscaling with an Alarm Action 262
What About Shrinking? 263
Other Things to Watch 264
Ingesting Logs into Cloud Watch 264
Creating an I AM User for Log Streaming 264
Installing the CloudWatch Agent 266
Creating a Metric Filter 267
Creating an Alarm from a Metric Filter 269
So Much More to See and Do 272
18 Backup and Restoration 273
Patterns to Supplement Backups 273
Backup via Imaging 274
HDFS Replication 275
Cloud Storage Filesystems 276
HDFS Snapshots 277
Hive Metastore Replication 277
Logs 278
A General Cloud Hadoop Backup Strategy 279
Not So Different, But Better 279
To the Cloud 280
A Hadoop Component Start and Stop Scripts 281
B Hadoop Cluster Configuration Scripts 285
C Monitoring Cloud Clusters with Nagios 297
Index 305