Read an Excerpt
Hadoop For Dummies
By Dirk deRoos
John Wiley & SonsCopyright © 2014 John Wiley & Sons, Ltd
All rights reserved.
Introducing Hadoop and Seeing What It's Good For
In This Chapter
* Seeing how Hadoop fills a need
* Digging (a bit) into Hadoop's history
* Getting Hadoop for yourself
* Looking at Hadoop application offerings
Organizations are flooded with data. Not only that, but in an era of incredibly cheap storage where everyone and everything are interconnected, the nature of the data we're collecting is also changing. For many businesses, their critical data used to be limited to their transactional databases and data warehouses. In these kinds of systems, data was organized into orderly rows and columns, where every byte of information was well understood in terms of its nature and its business value. These databases and warehouses are still extremely important, but businesses are now differentiating themselves by how they're finding value in the large volumes of data that are not stored in a tidy database.
The variety of data that's available now to organizations is incredible: Internally, you have website clickstream data, typed notes from call center operators, e-mail and instant messaging repositories; externally, open data initiatives from public and private entities have made massive troves of raw data available for analysis. The challenge here is that traditional tools are poorly equipped to deal with the scale and complexity of much of this data. That's where Hadoop comes in. It's tailor-made to deal with all sorts of messiness. CIOs everywhere have taken notice, and Hadoop is rapidly becoming an established platform in any serious IT department.
This chapter is a newcomer's welcome to the wonderful world of Hadoop — its design, capabilities, and uses. If you're new to big data, you'll also find important background information that applies to Hadoop and other solutions.
Big Data and the Need for Hadoop
Like many buzzwords, what people mean when they say "big data" is not always clear. This lack of clarity is made worse by IT people trying to attract attention to their own projects by labeling them as "big data," even though there's nothing big about them.
At its core, big data is simply a way of describing data problems that are unsolvable using traditional tools. To help understand the nature of big data problems, we like the "the three Vs of big data," which are a widely accepted characterization for the factors behind what makes a data challenge "big":
[check] Volume: High volumes of data ranging from dozens of terabytes, and even petabytes.
[check] Variety: Data that's organized in multiple structures, ranging from raw text (which, from a computer's perspective, has little or no discernible structure — many people call this unstructured data) to log files (commonly referred to as being semistructured) to data ordered in strongly typed rows and columns (structured data). To make things even more confusing, some data sets even include portions of all three kinds of data. (This is known as multistructured data.)
[check] Velocity: Data that enters your organization and has some kind of value for a limited window of time — a window that usually shuts well before the data has been transformed and loaded into a data warehouse for deeper analysis (for example, financial securities ticker data, which may reveal a buying opportunity, but only for a short while). The higher the volumes of data entering your organization per second, the bigger your velocity challenge.
Each of these criteria clearly poses its own, distinct challenge to someone wanting to analyze the information. As such, these three criteria are an easy way to assess big data problems and provide clarity to what has become a vague buzzword. The commonly held rule of thumb is that if your data storage and analysis work exhibits any of these three characteristics, chances are that you've got yourself a big data challenge.
As you'll see in this book, Hadoop is anything but a traditional information technology tool, and it is well suited to meet many big data challenges, especially (as you'll soon see) with high volumes of data and data with a variety of structures. But there are also big data challenges where Hadoop isn't well suited — in particular, analyzing high-velocity data the instant it enters an organization. Data velocity challenges involve the analysis of data while it's in motion, whereas Hadoop is tailored to analyze data when it's at rest. The lesson to draw from this is that although Hadoop is an important tool for big data analysis, it will by no means solve all your big data problems. Unlike some of the buzz and hype, the entire big data domain isn't synonymous with Hadoop.
Exploding data volumes
It is by now obvious that we live in an advanced state of the information age. Data is being generated and captured electronically by networked sensors at tremendous volumes, in ever-increasing velocities and in mind-boggling varieties. Devices such as mobile telephones, cameras, automobiles, televisions, and machines in industry and health care all contribute to the exploding data volumes that we see today. This data can be browsed, stored, and shared, but its greatest value remains largely untapped. That value lies in its potential to provide insight that can solve vexing business problems, open new markets, reduce costs, and improve the overall health of our societies.
In the early 2000s (we like to say "the oughties"), companies such as Yahoo! and Google were looking for a new approach to analyzing the huge amounts of data that their search engines were collecting. Hadoop is the result of that effort, representing an efficient and cost-effective way of reducing huge analytical challenges to small, manageable tasks.
Varying data structures
Structured data is characterized by a high degree of organization and is typically the kind of data you see in relational databases or spreadsheets. Because of its defined structure, it maps easily to one of the standard data types (or user-defined types that are based on those standard types). It can be searched using standard search algorithms and manipulated in well-defined ways.
Semistructured data (such as what you might see in log files) is a bit more difficult to understand than structured data. Normally, this kind of data is stored in the form of text files, where there is some degree of order — for example, tab-delimited files, where columns are separated by a tab character. So instead of being able to issue a database query for a certain column and knowing exactly what you're getting back, users typically need to explicitly assign data types to any data elements extracted from semistructured data sets.
Unstructured data has none of the advantages of having structure coded into a data set. (To be fair, the unstructured label is a bit strong — all data stored in a computer has some degree of structure. When it comes to so-called unstructured data, there's simply too little structure in order to make much sense of it.) Its analysis by way of more traditional approaches is difficult and costly at best, and logistically impossible at worst. Just imagine having many years' worth of notes typed by call center operators that describe customer observations. Without a robust set of text analytics tools, it would be extremely tedious to determine any interesting behavior patterns. Moreover, the sheer volume of data in many cases poses virtually insurmountable challenges to traditional data mining techniques, which, even when conditions are good, can handle only a fraction of the valuable data that's available.
A playground for data scientists
A data scientist is a computer scientist who loves data (lots of data) and the sublime challenge of figuring out ways to squeeze every drop of value out of that abundant data. A data playground is an enterprise store of many terabytes (or even petabytes) of data that data scientists can use to develop, test, and enhance their analytical "toys."
Now that you know what big data is all about, what it is, and why it's important, it's time to introduce Hadoop, the granddaddy of these nontraditional analytical toys. Understanding how this amazing platform for the analysis of big data came to be, and acquiring some basic principles about how it works, will help you to master the details we provide in the remainder of this book.
The Origin and Design of Hadoop
So what exactly is this thing with the funny name — Hadoop? At its core, Hadoop is a framework for storing data on large clusters of commodity hardware — everyday computer hardware that is affordable and easily available — and running applications against that data. A cluster is a group of interconnected computers (known as nodes) that can work together on the same problem. Using networks of affordable compute resources to acquire business insight is the key value proposition of Hadoop.
As for that name, Hadoop, don't look for any major significance there; it's simply the name that Doug Cutting's son gave to his stuffed elephant. (Doug Cutting is, of course, the co-creator of Hadoop.) The name is unique and easy to remember — characteristics that made it a great choice.
Hadoop consists of two main components: a distributed processing framework named MapReduce (which is now supported by a component called YARN, which we describe a little later) and a distributed file system known as the Hadoop distributed file system, or HDFS.
An application that is running on Hadoop gets its work divided among the nodes (machines) in the cluster, and HDFS stores the data that will be processed. A Hadoop cluster can span thousands of machines, where HDFS stores data, and MapReduce jobs do their processing near the data, which keeps I/O costs low. MapReduce is extremely flexible, and enables the development of a wide variety of applications.
As you might have surmised, a Hadoop cluster is a form of compute cluster, a type of cluster that's used mainly for computational purposes. In a compute cluster, many computers (compute nodes) can share computational workloads and take advantage of a very large aggregate bandwidth across the cluster. Hadoop clusters typically consist of a few master nodes, which control the storage and processing systems in Hadoop, and many slave nodes, which store all the cluster's data and is also where the data gets processed.
Distributed processing with MapReduce
MapReduce involves the processing of a sequence of operations on distributed data sets. The data consists of key-value pairs, and the computations have only two phases: a map phase and a reduce phase. User-defined MapReduce jobs run on the compute nodes in the cluster.
Generally speaking, a MapReduce job runs as follows:
1. During the Map phase, input data is split into a large number of fragments, each of which is assigned to a map task.
2. These map tasks are distributed across the cluster.
3. Each map task processes the key-value pairs from its assigned fragment and produces a set of intermediate key-value pairs.
4. The intermediate data set is sorted by key, and the sorted data is partitioned into a number of fragments that matches the number of reduce tasks.
5. During the Reduce phase, each reduce task processes the data fragment that was assigned to it and produces an output key-value pair.
6. These reduce tasks are also distributed across the cluster and write their output to HDFS when finished.
The Hadoop MapReduce framework in earlier (pre-version 2) Hadoop releases has a single master service called a JobTracker and several slave services called TaskTrackers, one per node in the cluster. When you submit a MapReduce job to the JobTracker, the job is placed into a queue and then runs according to the scheduling rules defined by an administrator. As you might expect, the JobTracker manages the assignment of map-and-reduce tasks to the TaskTrackers.
With Hadoop 2, a new resource management system is in place called YARN (short for Yet Another Resource Manager). YARN provides generic scheduling and resource management services so that you can run more than just Map Reduce applications on your Hadoop cluster. The JobTracker/TaskTracker architecture could only run MapReduce.
We describe YARN and the JobTracker/TaskTracker architectures in Chapter 7.
HDFS also has a master/slave architecture:
[check] Master service: Called a NameNode, it controls access to data files.
[check] Slave services: Called DataNodes, they're distributed one per node in the cluster. DataNodes manage the storage that's associated with the nodes on which they run, serving client read and write requests, among other tasks.
For more information on HDFS, see Chapter 4.
Apache Hadoop ecosystem
This section introduces other open source components that are typically seen in a Hadoop deployment. Hadoop is more than MapReduce and HDFS: It's also a family of related projects (an ecosystem, really) for distributed computing and large-scale data processing. Most (but not all) of these projects are hosted by the Apache Software Foundation. Table 1-1 lists some of these projects.
The Hadoop ecosystem and its commercial distributions (see the "Comparing distributions" section, later in this chapter) continue to evolve, with new or improved technologies and tools emerging all the time.
Figure 1-1 shows the various Hadoop ecosystem projects and how they relate to one-another:
[FIGURE 1-1 OMITTED]
Examining the Various Hadoop Offerings
Hadoop is available from either the Apache Software Foundation or from companies that offer their own Hadoop distributions.
Only products that are available directly from the Apache Software Foundation can be called Hadoop releases. Products from other companies can include the official Apache Hadoop release files, but products that are "forked" from (and represent modified or extended versions of) the Apache Hadoop source tree are not supported by the Apache Software Foundation.
Apache Hadoop has two important release series:
[check] 1.x: At the time of writing, this release is the most stable version of Hadoop available (1.2.1).
Even after the 2.x release branch became available, this is still commonly found in production systems. All major Hadoop distributions include solutions for providing high availability for the NameNode service, which first appears in the 2.x release branch of Hadoop.
[check] 2.x: At the time of writing, this is the current version of Apache Hadoop (2.2.0), including these features:
A MapReduce architecture, named MapReduce 2 or YARN (Yet Another Resource Negotiator): It divides the two major functions of the JobTracker (resource management and job life-cycle management) into separate components.
HDFS availability and scalability: The major limitation in Hadoop 1 was that the NameNode was a single point of failure. Hadoop 2 provides the ability for the NameNode service to fail over to an active standby NameNode. The NameNode is also enhanced to scale out to support clusters with very large numbers of files. In Hadoop 1, clusters could typically not expand beyond roughly 5000 nodes. By adding multiple active NameNode services, with each one responsible for managing specific partitions of data, you can scale out to a much greater degree.
Some descriptions around the versioning of Hadoop are confusing because both Hadoop 1.x and 2.x are at times referenced using different version numbers: Hadoop 1.0 is occasionally known as Hadoop 0.20.205, while Hadoop 2.x is sometimes referred to as Hadoop 0.23. As of December 2011, the Apache Hadoop project was deemed to be production-ready by the open source community, and the Hadoop 0.20.205 version number was officially changed to 1.0.0. Since then, legacy version numbering (below version 1.0) has persisted, partially because work on Hadoop 2.x was started well before the version numbering jump to 1.0 was made, and the Hadoop 0.23 branch was already created. Now that Hadoop 2.2.0 is production-ready, we're seeing the old numbering less and less, but it still surfaces every now and then.
Excerpted from Hadoop For Dummies by Dirk deRoos. Copyright © 2014 John Wiley & Sons, Ltd. Excerpted by permission of John Wiley & Sons.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.