Table of Contents
Foreword xi
Preface xiii
1 Introducing Database Reliability Engineering 1
Guiding Principles of the DBRE 2
Protect the Data 2
Self-Service for Scale 3
Elimination of Toil 4
Databases Are Not Special Snowflakes 5
Eliminate the Barriers Between Software and Operations 5
Operations Core Overview 6
Hierarchy of Needs 7
Survival and Safety 7
Love and Belonging 8
Esteem 9
Self-actualization 10
Wrapping Up 11
2 Service-Level Management 13
Why Do I Need Service-Level Objectives? 13
Service-Level Indicators 15
Latency 15
Availability 16
Throughput 16
Durability 16
Cost or Efficiency 16
Defining Service Objectives 17
Latency Indicators 17
Availability Indicators 20
Throughput Indicators 23
Monitoring and Reporting on SLOs 25
Monitoring Availability 25
Monitoring Latency 28
Monitoring Throughput 28
Monitoring Cost and Efficiency 28
Wrapping Up 29
3 Risk Management 31
Risk Considerations 32
Unknown Factors and Complexity 32
Availability of Resources 33
Human Factors 33
Group Factors 34
What Do We Do? 35
What Not to Do 35
A Working Process: Bootstrapping 37
Service Risk Evaluation 38
Architectural Inventory 40
Prioritization 41
Control and Decision Making 43
Ongoing Iterations 46
Wrapping Up 48
4 Operational Visibility 49
The New Rules of Operational Visibility 51
Treat Op Viz Systems Like BI Systems 52
Distributed Ephemeral Environments Trending to the Norm 52
Store at High Resolutions for Key Metrics 54
Keep Your Architecture Simple 55
An Op Viz Framework 56
Data In 57
Telemetry/Metrics 59
Events 60
Logs 60
Data Out 60
Bootstrapping Your Monitoring 61
Is the Data Safe? 63
Is the Service Up? 64
Are the Consumers in Pain? 65
Instrumenting the Application 66
Distributed Tracing 66
Events and Logs 68
Instrumenting the Server or Instance 68
Events and Logs 70
Instrumenting the Datastore 71
Datastore Connection Layer 71
Utilization 71
Saturation 72
Errors 73
Internal Database Visibility 74
Throughput and Latency Metrics 74
Commits, Redo, and Journaling 75
Replication State 75
Memory Structures 76
Locking and Concurrency 77
Database Objects 78
Database Queries 79
Database Asserts and Events 79
Wrapping Up 80
5 Infrastructure Engineering 81
Hosts 81
Physical Servers 81
Operating a System and Kernel 82
Storage Area Networks 92
Benefits of Physical Servers 92
Cons of Physical Servers 92
Virtualization 93
Hypervisor 93
Concurrency 94
Storage 94
Use Cases 94
Containers 95
Database as a Service 95
Challenges of DBaaS 96
The DBRE and the DBaaS 96
Wrapping Up 97
6 Infrastructure Management 99
Version Control 100
Configuration Definition 101
Building from Configuration 103
Maintaining Configuration 104
Enforcement of Configuration Definitions 105
Infrastructure Definition and Orchestration 105
Monolithic Infrastructure Definitions 106
Separating Vertically 107
Separated Tiers (Horizontal Definitions) 108
Acceptance Testing and Compliance 109
Service Catalog 109
Bringing It All Together 110
Development Environments 111
Wrapping Up 112
7 Backup and Recovery 113
Core Concepts 114
Physical versus Logical 114
Online versus Offline 114
Full, Incremental, and Differential 115
Considerations for Recovery 115
Recovery Scenarios 116
Planned Recovery Scenarios 116
Unplanned Scenarios 118
Scenario scope 121
Scenario Impact 121
Anatomy of a Recovery Strategy 122
Building Block 1 Detection 122
Building Block 2 Tiered Storage 124
Building Block 3 A Varied Toolbox 125
Building Block 4 Testing 127
A Recovery Strategy Defined 128
Online, Fast Storage with Full and Incremental Backups 128
Online, Slow Storage with Full and Incremental Backups 129
Offline Storage 130
Object Storage 131
Wrapping Up 132
8 Release Management 133
Education and Collaboration 133
Become a Funnel 134
Foster Conversations 134
Domain-Specific Knowledge 135
Collaboration 137
Integration 138
Prerequisites 139
Testing 141
Test-Friendly Development Practices 142
Post-Commit Testing 143
Full Dataset Testing 144
Downstream Tests 145
Operational Tests 145
Deployment 146
Migrations and Versioning 146
Impact Analysis 147
Migration Patterns 148
Manual or Automated 151
Wrapping Up 151
9 Security 153
The Purpose of Security 153
Protecting Data from Theft 154
Protecting from Purposeful Damage 154
Protecting from Accidental Damage 154
Protecting Data from Exposure 155
Compliance and Auditing Standards 155
Database Security as a Function 155
Education and Collaboration 155
Self-Service 156
Integration and Testing 157
Operational Visibility 158
Vulnerabilities and Exploits 160
Stride 160
Dread 161
Basic Precautions 162
Denial of Service 163
SQL Injection 166
Network and Authentication Protocols 168
Encryption of Data 168
Financial Data 169
Personal Health Data 169
Private Individual Data 170
Military or Government Data 170
Confidential/Sensitive Business Data 170
Data in Transit 171
Data in the Database 174
Data in the Filesystem 177
Wrapping Up 179
10 Data Storage, Indexing, and Replication 181
Data Structure Storage 181
Database Row Storage 182
Sorted-String Tables and Log-Structured Merge Trees 185
Indexing 188
Logs and Databases 189
Data Replication 189
Single-Leader 190
Multi-Leader Replication 203
Wrapping Up 209
11 Datastore Field Guide 211
Conceptual Attributes of a Datastore 212
The Data Model 212
Transactions 216
Base 221
Internal Attributes of a Datastore 222
Storage 222
The Ubiquitous CAP Theorem Section 223
Consistency Latency Trade-offs 225
Availability 226
Wrapping Up 228
12 A Data Architecture Sampler 229
Architectural Components 229
Frontend Datastores 229
Data Access Layer 230
Database Proxies 231
Event and Message Systems 233
Caches and Memory Stores 235
Data Architectures 238
Lambda and Kappa 238
Event Sourcing 241
CQRS 242
Wrapping Up 243
13 Making the Case For DBRE 245
A Culture of Database Reliability 246
Breaking-Down Barriers 246
Data-Driven Decision Making 251
Data Integrity and Recoverability 252
Wrapping Up 252
Index 253