
Observability Engineering: Achieving Production Excellence
318
Observability Engineering: Achieving Production Excellence
318Paperback
Overview
Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa).
You'll explore:
- How the concept of observability applies to managing software at scale
- The value of practicing observability when delivering complex cloud native applications and systems
- The impact observability has across the entire software development lifecycle
- How and why different functional teams use observability with service-level objectives
- How to instrument your code to help future engineers understand the code you wrote today
- How to produce quality code for context-aware system debugging and maintenance
- How data-rich analytics can help you debug elusive issues
Related collections and offers
Product Details
ISBN-13: | 9781492076445 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 06/14/2022 |
Pages: | 318 |
Product dimensions: | 6.93(w) x 9.13(h) x 0.79(d) |
About the Author
Liz Fong-Jones is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
George Miranda is a former engineer turned product marketer at Honeycomb.io. He spent 15+ years building large scale distributed systems in the finance and video game industries. He discovered his knack for storytelling and now works to shape the tools, practices, and culture that help improve the lives of people responsible for managing production systems.
Table of Contents
Foreword xi
Preface xv
Part I The Path to Observability
1 What Is Observability? 3
The Mathematical Definition of Observability 4
Applying Observability to Software Systems 4
Mischaracterizations About Observability for Software 7
Why Observability Matters Now 8
Is This Really the Best Way? 9
Why Are Metrics and Monitoring Not Enough? 9
Debugging with Metrics Versus Observability 11
The Role of Cardinality 13
The Role of Dimensionality 14
Debugging with Observability 16
Observability Is for Modern Systems 17
Conclusion 17
2 How Debugging Practices Differ Between Observability and Monitoring 19
How Monitoring Data Is Used for Debugging 19
Troubleshooting Behaviors When Using Dashboards 21
The Limitations of Troubleshooting by Intuition 23
Traditional Monitoring Is Fundamentally Reactive 24
How Observability Enables Better Debugging 26
Conclusion 28
3 Lessons from Scaling Without Observability 29
An Introduction to Parse 29
Scaling at Parse 31
The Evolution Toward Modern Systems 33
The Evolution Toward Modern Practices 36
Shifting Practices at Parse 38
Conclusion 41
4 How Observability Relates to DevOps, SRE, and Cloud Native 43
Cloud Native, DevOps, and SRE in a Nutshell 43
Observability: Debugging Then Versus Now 45
Observability Empowers DevOps and SRE Practices 46
Conclusion 48
Part II Fundamentals of Observability
5 Structured Events Are the Building Blocks of Observability 51
Debugging with Structured Events 52
The Limitations of Metrics as a Building Block 53
The Limitations of Traditional Logs as a Building Block 55
Unstructured Logs 55
Structured Logs 56
Properties of Events That Are Useful in Debugging 57
Conclusion 59
6 Stitching Events into Traces 61
Distributed Tracing and Why It Matters Now 61
The Components of Tracing 63
Instrumenting a Trace the Hard Way 65
Adding Custom Fields into Trace Spans 68
Stitching Events into Traces 70
Conclusion 71
7 Instrumentation with OpenTelemetry 73
A Brief Introduction to Instrumentation 74
Open Instrumentation Standards 74
Instrumentation Using Code-Based Examples 75
Start with Automatic Instrumentation 76
Add Custom Instrumentation 78
Send Instrumentation Data to a Backend System 80
Conclusion 82
8 Analyzing Events to Achieve Observability 83
Debugging from Known Conditions 84
Debugging from First Principles 85
Using the Core Analysis Loop 86
Automating the Brute-Force Portion of the Core Analysis Loop 88
This Misleading Promise of AIOps 91
Conclusion 92
9 How Observability and Monitoring Come Together 95
Where Monitoring Fits 96
Where Observability Fits 97
System Versus Software Considerations 97
Assessing Your Organizational Needs 99
Exceptions: Infrastructure Monitoring That Can't Be Ignored 101
Real-World Examples 101
Conclusion 103
Part III Observability for Teams
10 Applying Observability Practices in Your Team 107
Join a Community Group 107
Start with the Biggest Pain Points 109
Buy Instead of Build 109
Flesh Out Your Instrumentation Iteratively 111
Look for Opportunities to Leverage Existing Efforts 112
Prepare for the Hardest Last Push 114
Conclusion 115
11 Observability-Driven Development 117
Test-Driven Development 117
Observability in the Development Cycle 118
Determining Where to Debug 119
Debugging in the Time of Microservices 120
How Instrumentation Drives Observability 121
Shifting Observability Left 123
Using Observability to Speed Up Software Delivery 123
Conclusion 125
12 Using Service-Level Objectives for Reliability 127
Traditional Monitoring Approaches Create Dangerous Alert Fatigue 127
Threshold Alerting Is for Known-Unknowns Only 129
User Experience Is a North Star 131
What Is a Service-Level Objective? 132
Reliable Alerting with SLOs 133
Changing Culture Toward SLO-Based Alerts: A Case Study 135
Conclusion 138
13 Acting on and Debugging SLO-Based Alerts 139
Alerting Before Your Error Budget Is Empty 139
Framing Time as a Sliding Window 141
Forecasting to Create a Predictive Burn Alert 142
The Lookahead Window 144
The Baseline Window 151
Acting on SLO Burn Alerts 152
Using Observability Data for SLOs Versus Time-Series Data 154
Conclusion 156
14 Observability and the Software Supply Chain 157
Why Slack Needed Observability 159
Instrumentation: Shared Client Libraries and Dimensions 161
Case Studies: Operationalizing the Supply Chain 164
Understanding Context Through Tooling 164
Embedding Actionable Alerting 166
Understanding What Changed 168
Conclusion 170
Part IV Observability at Scale
15 Build Versus Buy and Return on Investment 173
How to Analyze the ROI of Observability 174
The Real Costs of Building Your Own 175
The Hidden Costs of Using "Free" Software 175
The Benefits of Building Your Own 176
The Risks of Building Your Own 177
The Real Costs of Buying Software 179
The Hidden Financial Costs of Commercial Software 179
The Hidden Nonfinancial Costs of Commercial Software 180
The Benefits of Buying Commercial Software 181
The Risks of Buying Commercial Software 182
Buy Versus Build Is Not a Binary Choice 182
Conclusion 183
16 Efficient Data Storage 185
The Functional Requirements for Observability 185
Time-Series Databases Are Inadequate for Observability 187
Other Possible Data Stores 189
Data Storage Strategies 190
Case Study: The Implementation of Honeycomb's Retriever 193
Partitioning Data by Time 194
Storing Data by Column Within Segments 195
Performing Query Workloads 197
Querying for Traces 199
Querying Data in Real Time 200
Making It Affordable with Tiering 200
Making It Fast with Parallelism 201
Dealing with High Cardinality 202
Scaling and Durability Strategies 202
Notes on Building Your Own Efficient Data Store 204
Conclusion 205
17 Cheap and Accurate Enough: Sampling 207
Sampling to Refine Your Data Collection 207
Using Different Approaches to Sampling 209
Constant-Probability Sampling 209
Sampling on Recent Traffic Volume 210
Sampling Based on Event Content (Keys) 210
Combining per Key and Historical Methods 211
Choosing Dynamic Sampling Options 211
When to Make a Sampling Decision for Traces 211
Translating Sampling Strategies into Code 212
The Base Case 212
Fixed-Rate Sampling 213
Recording the Sample Rate 213
Consistent Sampling 215
Target Rate Sampling 216
Having More Than One Static Sample Rate 218
Sampling by Key and Target Rate 218
Sampling with Dynamic Rates on Arbitrarily Many Keys 220
Putting It All Together: Head and Tail per Key Target Rate Sampling 222
Conclusion 223
18 Telemetry Management with Pipelines 225
Attributes of Telemetry Pipelines 226
Routing 226
Security and Compliance 227
Workload Isolation 227
Data Buffering 228
Capacity Management 228
Data Filtering and Augmentation 229
Data Transformation 230
Ensuring Data Quality and Consistency 230
Managing a Telemetry Pipeline: Anatomy 231
Challenges When Managing a Telemetry Pipeline 233
Performance 233
Correctness 233
Availability 233
Reliability 234
Isolation 234
Data Freshness 234
Use Case: Telemetry Management at Slack 235
Metrics Aggregation 235
Logs and Trace Events 236
Open Source Alternatives 238
Managing a Telemetry Pipeline: Build Versus Buy 239
Conclusion 240
Part V Spreading Observability Culture
19 The Business Case for Observability 243
The Reactive Approach to Introducing Change 243
The Return on Investment of Observability 245
The Proactive Approach to Introducing Change 246
Introducing Observability as a Practice 248
Using the Appropriate Tools 249
Instrumentation 250
Data Storage and Analytics 250
Rolling Out Tools to Your Teams 251
Knowing When You Have Enough Observability 252
Conclusion 253
20 Observability's Stakeholders and Allies 255
Recognizing Nonengineering Observability Needs 255
Creating Observability Allies in Practice 258
Customer Support Teams 258
Customer Success and Product Teams 259
Sales and Executive Teams 260
Using Observability Versus Business Intelligence Tools 261
Query Execution Time 262
Accuracy 262
Recency 262
Structure 263
Time Windows 263
Ephemerality 264
Using Observability and BI Tools Together in Practice 264
Conclusion 265
21 An Observability Maturity Model 267
A Note About Maturity Models 267
Why Observability Needs a Maturity Model 268
About the Observability Maturity Model 269
Capabilities Referenced in the OMM 270
Respond to System Failure with Resilience 271
Deliver High-Quality Code 273
Manage Complexity and Technical Debt 274
Release on a Predictable Cadence 275
Understand User Behavior 276
Using the OMM for Your Organization 277
Conclusion 277
22 Where to Go from Here 279
Observability, Then Versus Now 279
Additional Resources 281
Predictions for Where Observability Is Going 282
Index 287