Foundations for Architecting Data Solutions: Managing Successful Data Projects

Foundations for Architecting Data Solutions: Managing Successful Data Projects

Foundations for Architecting Data Solutions: Managing Successful Data Projects

Foundations for Architecting Data Solutions: Managing Successful Data Projects

Paperback

$55.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan Seidman guide you through the major components necessary to start, architect, and develop successful big data projects.

Everyone from CIOs and COOs to lead architects and developers will explore a variety of big data architectures and applications, from massive data pipelines to web-scale applications. Each chapter addresses a piece of the software development life cycle and identifies patterns to maximize long-term success throughout the life of your project.

  • Start the planning process by considering the key data project types
  • Use guidelines to evaluate and select data management solutions
  • Reduce risk related to technology, your team, and vague requirements
  • Explore system interface design using APIs, REST, and pub/sub systems
  • Choose the right distributed storage system for your big data system
  • Plan and implement metadata collections for your data architecture
  • Use data pipelines to ensure data integrity from source to final storage
  • Evaluate the attributes of various engines for processing the data you collect

Product Details

ISBN-13: 9781492038740
Publisher: O'Reilly Media, Incorporated
Publication date: 09/27/2018
Pages: 187
Product dimensions: 5.40(w) x 9.00(h) x 0.50(d)

About the Author

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Jonathan is a software engineer on the Cloud team at Cloudera. Prior to that, he was a solutions architect at Cloudera working with partners to integrate their solutions with Cloudera’s software stack. Previously, he was a technical lead on the big data team at Orbitz Worldwide, helping to manage the Hadoop clusters for one of the most heavily trafficked sites on the internet. He's also a co-­founder of the Chicago Hadoop User Group and Chicago Big Data, co-author of Hadoop Application Architectures, technical editor for Hadoop in Practice, and has spoken at a number of industry conferences on Hadoop and big data,

Table of Contents

Preface vii

1 Key Data Project Types and Considerations 1

Major Data Project Types 1

Data Pipelines and Data Staging 4

Primary Considerations and Risk Management 4

Pipeline and Staging Team Makeup 15

Data Processing and Analysis 16

Primary Considerations and Risk Management 16

Data Processing and Analytics Team Makeup 20

Application Development 21

Primary Considerations and Risk Management 21

Application Development Team Makeup 26

Summary 26

2 Evaluating and Selecting Data Management Solutions 29

Stages of Open Source Projects 30

Private Incubation Stage 31

Release Stage 32

"Curing Cancer" Stage 32

Broken Promises Stage 33

Hardening Stage 34

Enterprise Stage 35

Decline and Slow Death Stage 36

Common Life Cycles for Open Source Projects 36

Open Sourcing a Dead Product 38

The Follower 39

Evaluating Benchmarks 40

Considerations for Technology Selection 41

Understanding the Building Blocks 42

Looking to a Guide for Advice 43

Using Analysts 43

Looking to Market Trends 44

Summary 45

3 Managing Risk in Data Projects 47

Categories of Risk 47

Technology Risk 47

Team Risk 48

Requirements Risk 48

Managing Risk 48

Categorizing Risk in Your Architecture 48

Technology Risk 51

Strength of the Team 52

Other Teams 54

Requirements Risk 54

Tying This All Together 55

Using Prototypes and Proofs of Concept 58

Build Two to Three Ways 58

Build PoCs and Then Throw Them Away 58

Deployment Considerations 59

Using Interfaces 59

Start Building Early 61

Test Often and Keep Records 61

Monitoring and Alerting 62

Communicating Risk 63

Collaborate and Gain Buy-In 63

Share the Risk 64

Using Risk as a Negotiation Tool 64

Summary 65

4 Interface Design 67

The Human Body 67

The Human Body Versus a Data Architecture 67

Decoupling 71

Decoupling Considerations 74

Specialization 74

What Makes a Good Interface Design 75

The Contract 75

The Abstraction 76

Versioning 76

Being Defensive 77

Documentation and Naming for interfaces 78

Nonfunctional Considerations 79

Availability 79

Response-Time Guarantees 80

Load Capacity 81

Using Testing to Determine SLAs 81

Common Interface Examples 82

Publish-Subscribe 82

Request-Response Asynchronous Example 84

Request-Response Synchronous Example 86

Summary 86

5 Distributed Storage Systems 89

Attributes of Distributed Storage Systems 89

Storage System Genealogy 90

Partitioning 91

Mutation Options 93

Read Paths 95

Availability Versus Consistency 99

Primary Use Cases 101

Storage System Breakdown 102

HDFS 102

S3 and Object Stores 104

Apache HBase 106

Apache Cassandra 107

Elasticsearch and Apache Solr 111

Newcomers: Apache Kudu and CockroachDB 113

In-Memory Storage Systems 115

Summary 118

6 The Meta of Enterprise Data 119

Reasons to Care About Metadata 120

Visibility 120

Relationships 121

Regulation 123

Types of Metadata in a Data Architecture 124

Data at Rest 124

Data in Motion 126

Metadata for Source Data 130

Metadata About Data Processing 130

Reports and Dashboards 132

Metadata Collection 132

Declarative Metadata Collection 133

Discovery of Metadata 134

Metadata Management in Practice 136

Summary 136

7 Ensuring Data Integrity 139

Examples of Building Data Pipelines to Ensure Data Integrity 140

Predefined Data Pipelines 141

Validation of Data Pipelines 146

Row Counts 146

Distinct Count 147

Full-Byte Comparison 148

Checksum Comparison 149

Summary 150

8 Data Processing 151

Attributes of Processing Engines 151

DAG Management 152

Compute Isolation 155

Performance 157

Fault Tolerance 157

Interaction Model 160

Batch and/or Streaming 161

Data Processing over Time 162

Summary 164

Index 167

From the B&N Reads Blog

Customer Reviews