Foundations for Architecting Data Solutions: Managing Successful Data Projects
187Foundations for Architecting Data Solutions: Managing Successful Data Projects
187Paperback
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
Everyone from CIOs and COOs to lead architects and developers will explore a variety of big data architectures and applications, from massive data pipelines to web-scale applications. Each chapter addresses a piece of the software development life cycle and identifies patterns to maximize long-term success throughout the life of your project.
- Start the planning process by considering the key data project types
- Use guidelines to evaluate and select data management solutions
- Reduce risk related to technology, your team, and vague requirements
- Explore system interface design using APIs, REST, and pub/sub systems
- Choose the right distributed storage system for your big data system
- Plan and implement metadata collections for your data architecture
- Use data pipelines to ensure data integrity from source to final storage
- Evaluate the attributes of various engines for processing the data you collect
Product Details
ISBN-13: | 9781492038740 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 09/27/2018 |
Pages: | 187 |
Product dimensions: | 5.40(w) x 9.00(h) x 0.50(d) |
About the Author
Jonathan is a software engineer on the Cloud team at Cloudera. Prior to that, he was a solutions architect at Cloudera working with partners to integrate their solutions with Cloudera’s software stack. Previously, he was a technical lead on the big data team at Orbitz Worldwide, helping to manage the Hadoop clusters for one of the most heavily trafficked sites on the internet. He's also a co-founder of the Chicago Hadoop User Group and Chicago Big Data, co-author of Hadoop Application Architectures, technical editor for Hadoop in Practice, and has spoken at a number of industry conferences on Hadoop and big data,
Table of Contents
Preface vii
1 Key Data Project Types and Considerations 1
Major Data Project Types 1
Data Pipelines and Data Staging 4
Primary Considerations and Risk Management 4
Pipeline and Staging Team Makeup 15
Data Processing and Analysis 16
Primary Considerations and Risk Management 16
Data Processing and Analytics Team Makeup 20
Application Development 21
Primary Considerations and Risk Management 21
Application Development Team Makeup 26
Summary 26
2 Evaluating and Selecting Data Management Solutions 29
Stages of Open Source Projects 30
Private Incubation Stage 31
Release Stage 32
"Curing Cancer" Stage 32
Broken Promises Stage 33
Hardening Stage 34
Enterprise Stage 35
Decline and Slow Death Stage 36
Common Life Cycles for Open Source Projects 36
Open Sourcing a Dead Product 38
The Follower 39
Evaluating Benchmarks 40
Considerations for Technology Selection 41
Understanding the Building Blocks 42
Looking to a Guide for Advice 43
Using Analysts 43
Looking to Market Trends 44
Summary 45
3 Managing Risk in Data Projects 47
Categories of Risk 47
Technology Risk 47
Team Risk 48
Requirements Risk 48
Managing Risk 48
Categorizing Risk in Your Architecture 48
Technology Risk 51
Strength of the Team 52
Other Teams 54
Requirements Risk 54
Tying This All Together 55
Using Prototypes and Proofs of Concept 58
Build Two to Three Ways 58
Build PoCs and Then Throw Them Away 58
Deployment Considerations 59
Using Interfaces 59
Start Building Early 61
Test Often and Keep Records 61
Monitoring and Alerting 62
Communicating Risk 63
Collaborate and Gain Buy-In 63
Share the Risk 64
Using Risk as a Negotiation Tool 64
Summary 65
4 Interface Design 67
The Human Body 67
The Human Body Versus a Data Architecture 67
Decoupling 71
Decoupling Considerations 74
Specialization 74
What Makes a Good Interface Design 75
The Contract 75
The Abstraction 76
Versioning 76
Being Defensive 77
Documentation and Naming for interfaces 78
Nonfunctional Considerations 79
Availability 79
Response-Time Guarantees 80
Load Capacity 81
Using Testing to Determine SLAs 81
Common Interface Examples 82
Publish-Subscribe 82
Request-Response Asynchronous Example 84
Request-Response Synchronous Example 86
Summary 86
5 Distributed Storage Systems 89
Attributes of Distributed Storage Systems 89
Storage System Genealogy 90
Partitioning 91
Mutation Options 93
Read Paths 95
Availability Versus Consistency 99
Primary Use Cases 101
Storage System Breakdown 102
HDFS 102
S3 and Object Stores 104
Apache HBase 106
Apache Cassandra 107
Elasticsearch and Apache Solr 111
Newcomers: Apache Kudu and CockroachDB 113
In-Memory Storage Systems 115
Summary 118
6 The Meta of Enterprise Data 119
Reasons to Care About Metadata 120
Visibility 120
Relationships 121
Regulation 123
Types of Metadata in a Data Architecture 124
Data at Rest 124
Data in Motion 126
Metadata for Source Data 130
Metadata About Data Processing 130
Reports and Dashboards 132
Metadata Collection 132
Declarative Metadata Collection 133
Discovery of Metadata 134
Metadata Management in Practice 136
Summary 136
7 Ensuring Data Integrity 139
Examples of Building Data Pipelines to Ensure Data Integrity 140
Predefined Data Pipelines 141
Validation of Data Pipelines 146
Row Counts 146
Distinct Count 147
Full-Byte Comparison 148
Checksum Comparison 149
Summary 150
8 Data Processing 151
Attributes of Processing Engines 151
DAG Management 152
Compute Isolation 155
Performance 157
Fault Tolerance 157
Interaction Model 160
Batch and/or Streaming 161
Data Processing over Time 162
Summary 164
Index 167