Observability Engineering: Achieving Production Excellence

Observability Engineering: Achieving Production Excellence

Observability Engineering: Achieving Production Excellence

Observability Engineering: Achieving Production Excellence


    Qualifies for Free Shipping
    Choose Expedited Shipping at checkout for delivery by Thursday, March 7
    Check Availability at Nearby Stores

Related collections and offers


Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development.

Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa).

You'll explore:

  • How the concept of observability applies to managing software at scale
  • The value of practicing observability when delivering complex cloud native applications and systems
  • The impact observability has across the entire software development lifecycle
  • How and why different functional teams use observability with service-level objectives
  • How to instrument your code to help future engineers understand the code you wrote today
  • How to produce quality code for context-aware system debugging and maintenance
  • How data-rich analytics can help you debug elusive issues

Product Details

ISBN-13: 9781492076445
Publisher: O'Reilly Media, Incorporated
Publication date: 06/14/2022
Pages: 318
Sales rank: 693,452
Product dimensions: 7.00(w) x 9.19(h) x (d)

About the Author

Charity Majors is a cofounder and engineer at Honeycomb.io, a startup that blends the speed of time series with the raw power of rich events to give you interactive, iterative debugging for complex systems. She has worked at companies like Facebook, Parse, and Linden Lab, as a systems engineer and engineering manager, but always seems to end up responsible for the databases too.

Liz Fong-Jones is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

George Miranda is a former engineer turned product marketer at Honeycomb.io. He spent 15+ years building large scale distributed systems in the finance and video game industries. He discovered his knack for storytelling and now works to shape the tools, practices, and culture that help improve the lives of people responsible for managing production systems.

Table of Contents

Foreword xi

Preface xv

Part I The Path to Observability

1 What Is Observability? 3

The Mathematical Definition of Observability 4

Applying Observability to Software Systems 4

Mischaracterizations About Observability for Software 7

Why Observability Matters Now 8

Is This Really the Best Way? 9

Why Are Metrics and Monitoring Not Enough? 9

Debugging with Metrics Versus Observability 11

The Role of Cardinality 13

The Role of Dimensionality 14

Debugging with Observability 16

Observability Is for Modern Systems 17

Conclusion 17

2 How Debugging Practices Differ Between Observability and Monitoring 19

How Monitoring Data Is Used for Debugging 19

Troubleshooting Behaviors When Using Dashboards 21

The Limitations of Troubleshooting by Intuition 23

Traditional Monitoring Is Fundamentally Reactive 24

How Observability Enables Better Debugging 26

Conclusion 28

3 Lessons from Scaling Without Observability 29

An Introduction to Parse 29

Scaling at Parse 31

The Evolution Toward Modern Systems 33

The Evolution Toward Modern Practices 36

Shifting Practices at Parse 38

Conclusion 41

4 How Observability Relates to DevOps, SRE, and Cloud Native 43

Cloud Native, DevOps, and SRE in a Nutshell 43

Observability: Debugging Then Versus Now 45

Observability Empowers DevOps and SRE Practices 46

Conclusion 48

Part II Fundamentals of Observability

5 Structured Events Are the Building Blocks of Observability 51

Debugging with Structured Events 52

The Limitations of Metrics as a Building Block 53

The Limitations of Traditional Logs as a Building Block 55

Unstructured Logs 55

Structured Logs 56

Properties of Events That Are Useful in Debugging 57

Conclusion 59

6 Stitching Events into Traces 61

Distributed Tracing and Why It Matters Now 61

The Components of Tracing 63

Instrumenting a Trace the Hard Way 65

Adding Custom Fields into Trace Spans 68

Stitching Events into Traces 70

Conclusion 71

7 Instrumentation with OpenTelemetry 73

A Brief Introduction to Instrumentation 74

Open Instrumentation Standards 74

Instrumentation Using Code-Based Examples 75

Start with Automatic Instrumentation 76

Add Custom Instrumentation 78

Send Instrumentation Data to a Backend System 80

Conclusion 82

8 Analyzing Events to Achieve Observability 83

Debugging from Known Conditions 84

Debugging from First Principles 85

Using the Core Analysis Loop 86

Automating the Brute-Force Portion of the Core Analysis Loop 88

This Misleading Promise of AIOps 91

Conclusion 92

9 How Observability and Monitoring Come Together 95

Where Monitoring Fits 96

Where Observability Fits 97

System Versus Software Considerations 97

Assessing Your Organizational Needs 99

Exceptions: Infrastructure Monitoring That Can't Be Ignored 101

Real-World Examples 101

Conclusion 103

Part III Observability for Teams

10 Applying Observability Practices in Your Team 107

Join a Community Group 107

Start with the Biggest Pain Points 109

Buy Instead of Build 109

Flesh Out Your Instrumentation Iteratively 111

Look for Opportunities to Leverage Existing Efforts 112

Prepare for the Hardest Last Push 114

Conclusion 115

11 Observability-Driven Development 117

Test-Driven Development 117

Observability in the Development Cycle 118

Determining Where to Debug 119

Debugging in the Time of Microservices 120

How Instrumentation Drives Observability 121

Shifting Observability Left 123

Using Observability to Speed Up Software Delivery 123

Conclusion 125

12 Using Service-Level Objectives for Reliability 127

Traditional Monitoring Approaches Create Dangerous Alert Fatigue 127

Threshold Alerting Is for Known-Unknowns Only 129

User Experience Is a North Star 131

What Is a Service-Level Objective? 132

Reliable Alerting with SLOs 133

Changing Culture Toward SLO-Based Alerts: A Case Study 135

Conclusion 138

13 Acting on and Debugging SLO-Based Alerts 139

Alerting Before Your Error Budget Is Empty 139

Framing Time as a Sliding Window 141

Forecasting to Create a Predictive Burn Alert 142

The Lookahead Window 144

The Baseline Window 151

Acting on SLO Burn Alerts 152

Using Observability Data for SLOs Versus Time-Series Data 154

Conclusion 156

14 Observability and the Software Supply Chain 157

Why Slack Needed Observability 159

Instrumentation: Shared Client Libraries and Dimensions 161

Case Studies: Operationalizing the Supply Chain 164

Understanding Context Through Tooling 164

Embedding Actionable Alerting 166

Understanding What Changed 168

Conclusion 170

Part IV Observability at Scale

15 Build Versus Buy and Return on Investment 173

How to Analyze the ROI of Observability 174

The Real Costs of Building Your Own 175

The Hidden Costs of Using "Free" Software 175

The Benefits of Building Your Own 176

The Risks of Building Your Own 177

The Real Costs of Buying Software 179

The Hidden Financial Costs of Commercial Software 179

The Hidden Nonfinancial Costs of Commercial Software 180

The Benefits of Buying Commercial Software 181

The Risks of Buying Commercial Software 182

Buy Versus Build Is Not a Binary Choice 182

Conclusion 183

16 Efficient Data Storage 185

The Functional Requirements for Observability 185

Time-Series Databases Are Inadequate for Observability 187

Other Possible Data Stores 189

Data Storage Strategies 190

Case Study: The Implementation of Honeycomb's Retriever 193

Partitioning Data by Time 194

Storing Data by Column Within Segments 195

Performing Query Workloads 197

Querying for Traces 199

Querying Data in Real Time 200

Making It Affordable with Tiering 200

Making It Fast with Parallelism 201

Dealing with High Cardinality 202

Scaling and Durability Strategies 202

Notes on Building Your Own Efficient Data Store 204

Conclusion 205

17 Cheap and Accurate Enough: Sampling 207

Sampling to Refine Your Data Collection 207

Using Different Approaches to Sampling 209

Constant-Probability Sampling 209

Sampling on Recent Traffic Volume 210

Sampling Based on Event Content (Keys) 210

Combining per Key and Historical Methods 211

Choosing Dynamic Sampling Options 211

When to Make a Sampling Decision for Traces 211

Translating Sampling Strategies into Code 212

The Base Case 212

Fixed-Rate Sampling 213

Recording the Sample Rate 213

Consistent Sampling 215

Target Rate Sampling 216

Having More Than One Static Sample Rate 218

Sampling by Key and Target Rate 218

Sampling with Dynamic Rates on Arbitrarily Many Keys 220

Putting It All Together: Head and Tail per Key Target Rate Sampling 222

Conclusion 223

18 Telemetry Management with Pipelines 225

Attributes of Telemetry Pipelines 226

Routing 226

Security and Compliance 227

Workload Isolation 227

Data Buffering 228

Capacity Management 228

Data Filtering and Augmentation 229

Data Transformation 230

Ensuring Data Quality and Consistency 230

Managing a Telemetry Pipeline: Anatomy 231

Challenges When Managing a Telemetry Pipeline 233

Performance 233

Correctness 233

Availability 233

Reliability 234

Isolation 234

Data Freshness 234

Use Case: Telemetry Management at Slack 235

Metrics Aggregation 235

Logs and Trace Events 236

Open Source Alternatives 238

Managing a Telemetry Pipeline: Build Versus Buy 239

Conclusion 240

Part V Spreading Observability Culture

19 The Business Case for Observability 243

The Reactive Approach to Introducing Change 243

The Return on Investment of Observability 245

The Proactive Approach to Introducing Change 246

Introducing Observability as a Practice 248

Using the Appropriate Tools 249

Instrumentation 250

Data Storage and Analytics 250

Rolling Out Tools to Your Teams 251

Knowing When You Have Enough Observability 252

Conclusion 253

20 Observability's Stakeholders and Allies 255

Recognizing Nonengineering Observability Needs 255

Creating Observability Allies in Practice 258

Customer Support Teams 258

Customer Success and Product Teams 259

Sales and Executive Teams 260

Using Observability Versus Business Intelligence Tools 261

Query Execution Time 262

Accuracy 262

Recency 262

Structure 263

Time Windows 263

Ephemerality 264

Using Observability and BI Tools Together in Practice 264

Conclusion 265

21 An Observability Maturity Model 267

A Note About Maturity Models 267

Why Observability Needs a Maturity Model 268

About the Observability Maturity Model 269

Capabilities Referenced in the OMM 270

Respond to System Failure with Resilience 271

Deliver High-Quality Code 273

Manage Complexity and Technical Debt 274

Release on a Predictable Cadence 275

Understand User Behavior 276

Using the OMM for Your Organization 277

Conclusion 277

22 Where to Go from Here 279

Observability, Then Versus Now 279

Additional Resources 281

Predictions for Where Observability Is Going 282

Index 287

From the B&N Reads Blog

Customer Reviews