Why Programs Fail: A Guide to Systematic Debugging / Edition 2

Paperback (Print)
Used and New from Other Sellers
Used and New from Other Sellers
from $50.17
Usually ships in 1-2 business days
(Save 28%)
Other sellers (Paperback)
  • All (7) from $50.17   
  • New (5) from $50.17   
  • Used (2) from $66.44   


This book is proof that debugging has graduated from a black art to a systematic discipline. It demystifies one of the toughest aspects of software programming, showing clearly how to discover what caused software failures, and fix them with minimal muss and fuss.

The fully updated second edition includes 100+ pages of new material, including new chapters on Verifying Code, Predicting Erors, and Preventing Errors. Cutting-edge tools such as FindBUGS and AGITAR are explained, techniques from integrated environments like Jazz.net are highlighted, and all-new demos with ESC/Java and Spec#, Eclipse and Mozilla are included.

This complete and pragmatic overview of debugging is authored by Andreas Zeller, the talented researcher who developed the GNU Data Display Debugger(DDD), a tool that over 250,000 professionals use to visualize the data structures of programs while they are running. Unlike other books on debugging, Zeller's text is product agnostic, appropriate for all programming languages and skill levels.

The book explains best practices ranging from systematically tracking error reports, to observing symptoms, reproducing errors, and correcting defects. It covers a wide range of tools and techniques from hands-on observation to fully automated diagnoses, and also explores the author's innovative techniques for isolating minimal input to reproduce an error and for tracking cause and effect through a program. It even includes instructions on how to create automated debugging tools.

The text includes exercises and extensive references for further study, and a companion website with source code for all examples and additional debugging resources is available.

• The new edition of this award-winning productivity-booster is for any developer who has ever been frustrated by elusive bugs

• Brand new chapters demonstrate cutting-edge debugging techniques and tools, enabling readers to put the latest time-saving developments to work for them

• Learn by doing. New exercises and detailed examples focus on emerging tools, languages and environments, including AGITAR, FindBUGS, Python and Eclipse.

Why Programs Fail is about bugs in computer programs, how to find them, how to reproduce them, and how to fix them in such a way that they do not occur anymore. This is the first comprehensive book on systematic debugging and covers a wide range of tools and techniques ranging from hands-on observation to fully automated diagnoses, and includes instructions for building automated debuggers. This discussion is built upon a solid theory of how failures occur, rather than relying on seat-of-the-pants techniques, which are of little help with large software systems or to those learning to program. The author, Andreas Zeller, is well known in the programming community for creating the GNU Data Display Debugger (DDD), a tool that visualizes the data structures of a program while it is running. Suitable for any programming language and all levels of programming experience

Read More Show Less

Editorial Reviews

From the Publisher
Praise from the experts for the first edition:

"In this book, Andreas Zeller does an excellent job introducing useful debugging techniques and tools invented in both academia and industry. The book is easy to read and actually very fun as well. It will not only help you discover a new perspective on debugging, but it will also teach you some fundamental static and dynamic program analysis techniques in plain language."
-Miryung Kim, Software Developer, Motorola Korea

"Today every computer program written is also debugged, but debugging is not a widely studied or taught skill. Few books beyond this one present a systematic approach to finding and fixing programming errors."
-James Larus, Microsoft Research

"From the author of ODD, the famous data display debugger, now comes the definitive book on debugging. Zeller's book is chock-full with advice, insight, and tools to track down defects in programs, for all levels of experience and any programming language. The book is lucidly written, explaining the principles of every technique without boring the reader with minutiae. And best of all, at the end of each chapter it tells you where to download all those fancy tools. A great book for the software professional as well as the student interested in the frontiers of automated debugging."
-Walter F. Tichy, Professor, University Karlsruhe, Germany

"Andreas Zeller's Why Programs Fail lays an excellent foundation far practitioners, educators, and researchers alike. Using a disciplined approach based on the scientific method, Zeller provides deep insights, detailed approaches, and illustrative examples."
-David Notkin, Professor Computer Science & Engineering, University of Washington

Read More Show Less

Product Details

  • ISBN-13: 9780123745156
  • Publisher: Elsevier Science
  • Publication date: 6/26/2009
  • Edition description: New Edition
  • Edition number: 2
  • Pages: 424
  • Sales rank: 558,203
  • Product dimensions: 7.50 (w) x 9.10 (h) x 0.90 (d)

Meet the Author

Andreas Zeller is a full professor for Software Engineering at Saarland University in Saarbruecken, Germany. His research concerns the analysis of large software systems and their development process; his students are funded by companies like Google, Microsoft, or SAP. In 2010, Zeller was inducted as Fellow of the ACM for his contributions to automated debugging and mining software archives. In 2011, he received an ERC Advanced Grant, Europe's highest and most prestigious individual research grant, for work on specification mining and test case generation. His book "Why programs fail", the "standard reference on debugging", obtained the 2006 Software Development Jolt Productivity Award.

Read More Show Less

Read an Excerpt

Why Programs Fail

A Guide to Systematic Debugging
By Andreas Zeller

Morgan Kaufmann

Copyright © 2009 Elsevier Inc.
All right reserved.

ISBN: 978-0-08-092300-0

Chapter One

How Failures Come to Be

Your program fails. How can this be? The answer is that the programmer creates a defect in the code. When the code is executed, the defect causes an infection in the program state, which later becomes visible as a failure. To find the defect, one must reason backward, starting with the failure. This chapter defines the essential concepts when talking about debugging, and hints at the techniques discussed subsequently—hopefully whetting your appetite for the remainder of this book.


Oops! Your program fails. Now what? This is a common situation that interrupts our routine and requires immediate attention. Because the program mostly worked until now, we assume that something external has crept into our machine—something that is natural and unavoidable;something we are not responsible for—namely,a bug.

If you are a user, you have probably already learned to live with bugs. You may even think that bugs are unavoidable when it comes to software. As a programmer, though, you know that bugs do not creep out of mother nature into our programs. (See Bug Story 1 for an exception.) Rather, bugs are inherent parts of the programs we produce. At the beginning of any bug story stands a human who produces the program in question.

The following is a small program I once produced. The sample program is a very simple sorting tool. Given a list of numbers as command-line arguments, sample prints them as a sorted list on the standard output ($ is the command-line prompt).

$ ./sample 9 7 8 Output: 7 8 9 $ _

Unfortunately, sample does not always work properly, as demonstrated by the following failure:

$ ./sample 11 14 Output: 0 11 $ _

Although the sample output is sorted and contains the right number of items, some original arguments are missing and replaced by bogus numbers. Here, 14 is missing and replaced by 0. (Actual bogus numbers and behavior on your system may vary.) From the sample failure, we can deduce that sample has a bug (or, more precisely, a defect). This brings us to the key question of this chapter:


In general, a failure such as that in the sample program comes about in the four stages discussed in the following.

1. The programmer creates a defect. A defect is a piece of the code that can cause an infection. Because the defect is part of the code, and because every code is initially written by a programmer, the defect is technically created by the programmer. If the programmer creates a defect, does that mean the programmer was at fault? Not necessarily. Consider the following:

* The original requirements did not foresee future changes. Think about the Y2K problem, for instance.

* A program behavior may become classified as a "failure" only when the user sees it for the first time.

* In a modular program, a failure may happen because of incompatible interfaces of two modules.

* In a distributed program, a failure may be the result of some unpredictable interaction of several components.

In such settings, deciding on who is to blame is a political, not a technical, question. Nobody made a mistake, and yet a failure occurred. (See Bug Story 2 for more on such failures.)

2. The defect causes an infection. The program is executed, and with it the defect. The defect now creates an infection—that is, after execution of the defect, the program state differs from what the programmer intended.

A defect in the code does not necessarily cause an infection. The defective code must be executed, and it must be executed under such conditions that the infection actually occurs.

3. The infection propagates. Most functions result in errors when fed with erroneous input. As the remaining program execution accesses the state, it generates further infections that can spread into later program states. An infection need not, however, propagate continuously. It may be overwritten, masked, or corrected by some later program action.

4. The infection causes a failure. A failure is an externally observable error in the program behavior. It is caused by an infection in the program state.

The program execution process is sketched in Figure 1.1. Each program state consists of the values of the program variables, as well as the current execution position (formally, the program counter). Each state determines subsequent states, up to the final state (at the bottom in the figure),in which we can observe the failure (indicated by the x).

Not every defect results in an infection, and not every infection results in a failure. Thus, having no failures does not imply having no defects. This is the curse of testing, as pointed out by Dijkstra. Testing can only show the presence of defects, but never their absence.

In the case of sample, though, we have actually experienced a failure. In hindsight, every failure is thus caused by some infection, and every infection is caused by some earlier infection, originating at the defect. This cause–effect chain from defect to failure is called an infection chain.

The issue of debugging is thus to identify the infection chain, to find its root cause (the defect),and to remove the defect such that the failure no longer occurs. This is what we shall do with the sample program.


In general, debugging of a program such as sample can be decomposed into seven steps (List 1.1), of which the initial letters form the word TRAFFIC.

1. Track the problem in the database.

2. Reproduce the failure.

3. Automate and simplify the test case.

4. Find possible infection origins.

5. Focus on the most likely origins.

6. Isolate the infection chain.

7. Correct the defect.

Of these steps, tracking the problem in a problem database is mere bookkeeping (see also Chapter 2) and reproducing the problem is not that difficult for deterministic programs such as sample. It can be difficult for nondeterministic programs and long-running programs, though, which is why Chapter 4 discusses the issues involved in reproducing failures.

Automating the test case is also rather straightforward, and results in automatic simplification (see also Chapter 5). The last step, correcting the defect, is usually simple once you have understood how the defect causes the failure (see Chapter 15).

The final three steps—from finding the infection origins to isolating the infection chain—are the steps concerned with understanding how the failure came to be. This task requires by far the most time, as well as other resources. Understanding how the failure came to be is what the rest of this section, and the other chapters of this book, are about.

Why is understanding the failure so difficult? Considering Figure 1.1,all one need do to find the defect is isolate the transition from a sane state (i.e., noninfected, as intended) to an infected state. This is a search in space (as we have to find out which part of the state is infected) as well as in time (as we have to find out when the infection takes place).

However, examination of space and time are enormous tasks for even the simplest programs. Each state consists of dozens, thousands, or even millions of variables. For example, Figure 1.2 shows a visualization of the program state of the GNU compiler (GCC) while compiling a program. The program state consists of about 44,000 individual variables, each with a distinct value, and about 42,000 references between variables. (Chapter 14 discusses how to obtain and use such program states in debugging.)

Not only is a single state quite large, a program execution consists of thousands, millions, or even billions of such states. Space and time thus form a wide area in which only two points are well known (Figure 1.3): initially, the entire state is sane (

  •  ), and eventually some part of the state is infected (x). Within the area spanned by space and time, the aim of debugging is to locate the defect—a single transition from sane (
  •  ) to infected (
  •  ) that eventually causes the failure (Figure 1.4).

Thinking about the dimensions of space and time, this may seem like searching for a needle in an endless row of haystacks—and indeed, the fact is that debugging is largely a search problem. This search is driven by the following two major principles:

1. Separate sane from infected. If a state is infected, it may be part of the infection propagating from defect to failure. If a state is sane, there is no infection to propagate.

2. Separate relevant from irrelevant. A variable value is the result of a limited number of earlier variable values. Thus, only some part of the earlier state may be relevant to the failure.

Figure 1.5 illustrates this latter technique. The failure, to reiterate, can only have been caused by a small number of other variables in earlier states (denoted using the exclamation point, !), the values of which in turn can only have come from other earlier variables. One says that subsequent, variable values depend on earlier values. This results in a series of dependences from the failure back to earlier variable values. To locate the defect, it suffices to examine these values only—as other values could not have possibly caused the failure—and separate these values into sane and infected. If we find an infected value, we must find and fix the defect that causes it. Typically, this is the same defect that causes the original failure.

Why is it that a variable value can be caused only by a small number of earlier variables? Good programming style dictates division of the state into units such that the information flow between these units is minimized. Typically, your programming language provides a means of structuring the state, just as it helps you to structure the program code. However, whether you divide the state into functions, modules, objects, packages, or components, the principle is the same: a divided state is much easier to conquer.


Let's put our knowledge about states and dependences into practice, following the TRAFFIC steps (List 1.1).

1.4.1 Track the Problem

The first step in debugging is to track the problem—that is, to file a problem report such that the defect will not go by unnoticed. In our case, we have already observed the failure symptom: the output of sample, when invoked with arguments 11 and 14,contains a zero.

$ ./sample 11 14 Output: 0 11 $ _

An actual problem report would include this invocation as an instruction on how to reproduce the problem (see Chapter 2).

1.4.2 Reproduce the Failure

In case of the sample program, reproducing the failure is easy. All you need do is reinvoke sample, as shown previously. In other cases, though, reproducing may require control over all possible input sources (techniques are described in Chapter 4).

1.4.3 Automate and Simplify the Test Case

If sample were a more complex program, we would have to think about how to automate the failure (in that we want to reproduce the failure automatically) and how to simplify its input such that we obtain a minimal test case. In the case of sample, though, this is not necessary (for more complex programs, Chapter 5 covers the details).

1.4.4 Find Possible Infection Origins

Where does the zero in the output come from? This is the fourth step in the TRAFFIC steps: We must find possible infection origins. To find possible origins, we need the actual C source code of sample, shown in Example 1.1. We quickly see that the program consists of two functions:shell_sort() (which implements the shell sort algorithm) and main, which realizes a simple test driver around shell_sort(). The main function:

* Allocates an array a (line 32).

* Copies the command-line arguments into a (lines 33–34).

* Sorts a by invoking shell_sort() (line 36).

* Prints the content of a (lines 38–41).

By matching the output to the appropriate code, we find that the 0 printed by sample is the value of the variable a[0], the first element of the array a. This is the infection we observe: at line 39 in sample.c, variable a[0] is obviously zero.

Where does the zero in a[0] come from? Working our way backward from line 40, we find in line 36 the call shell_sort(a, argc), where the array a is passed by reference. This function might well be the point at which a[0] was assigned the infected value.


Excerpted from Why Programs Fail by Andreas Zeller Copyright © 2009 by Elsevier Inc. . Excerpted by permission of Morgan Kaufmann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Read More Show Less

Table of Contents

Amended Table of Contents
• denotes additions/changes for the proposed second edition. For brevity, second-level sections are omitted from the list. Please note that there are also recurring end-of-chapter sections: Concepts, Tools, Further Reading, and Exercises.

Table of Contents
• Include a list of "How To's" as indicated in appropriate chapters
About the Author
• What's new in the second edition

1 How Failures Come to Be
1.1 My Program Does Not Work!
• New section "Facts on Bugs" - highlighting recent empirical findings
1.2 From Defects to Failures
1.3 Lost in Time and Space
1.4 From Failures to Fixes
1.5 Automated Debugging Techniques
1.6 Bugs, Faults, or Defects?
• New section "Learning From Mistakes" - pointing to the later chapter

2 Tracking Problems
2.1 Oh! All These Problems
2.2 Reporting Problems
2.3 Managing Problems
2.4 Classifying Problems
2.5 Processing Problems
2.6 Managing Problem Tracking
2.7 Requirements as Problems
2.8 Managing Duplicates
• New section "Collecting Problem Data" - laying the foundation for later investigation
2.9 Relating Problems and Fixes
2.10 Relating Problems and Tests
• 2.9 and 2.10 will be merged into a new section "A Concert of Activities", focusing on integrated environments like Jazz.net

3 Making Programs Fail
3.1 Testing for Debugging
3.2 Controlling the Program
3.3 Testing at the Presentation Layer
3.4 Testing at the Functionality Layer
3.5 Testing at the Unit Layer
3.6 Isolating Units
3.7 Designing for Debugging
• Expand on "design for diagnosability", esp. for embedded systems
3.8 Preventing Unknown Problems
• This section will be deleted and replaced with a whole new chapter 18

4 Reproducing Problems
4.1 The First Task in Debugging
4.2 Reproducing the Problem Environment
4.3 Reproducing Program Execution
4.4 Reproducing System Interaction
4.5 Focusing on Units
• Expand reflecting latest research results

5 Simplifying Problems
5.1 Simplifying the Problem
5.2 The Gecko BugAThon
5.3 Manual Simplification
5.4 Automatic Simplification
5.5 A Simplification Algorithm
5.6 Simplifying User Interaction
5.7 Random Input Simplified
5.8 Simplifying Faster

6 Scientific Debugging
6.1 How to Become a Debugging Guru
6.2 The Scientific Method
6.3 Applying the Scientific Method
6.4 Explicit Debugging
6.5 Keeping a Logbook
6.6 Debugging Quick-and-Dirty
6.7 Algorithmic Debugging
6.8 Deriving a Hypothesis
6.9 Reasoning About Programs

7 Deducing Errors
• This chapter will be renamed to "Tracking Dependences"
7.1 Isolating Value Origins
7.2 Understanding Control Flow
7.3 Tracking Dependences
7.4 Slicing Programs
7.5 Deducing Code Smells
• Move to new chapter 11 "Verifying Code"
7.6 Limits of Static Analysis
• Move to new chapter 11 "Verifying Code"

8 Observing Facts
8.1 Observing State
8.2 Logging Execution
8.3 Using Debuggers
8.4 Querying Events
8.5 Visualizing State

9 Tracking Origins
9.1 Reasoning Backwards
• Update with recent commercial tools
9.2 Exploring Execution History
9.3 Dynamic Slicing
9.4 Leveraging Origins
• Expand to use latest tools by Ko et al. as well as Gupta et al.
9.5 Tracking Down Infections

10 Asserting Expectations
10.1 Automating Observation
10.2 Basic Assertions
• Explain "design by contract" and its principles
10.3 Asserting Invariants
• Expand on integrating contracts with inheritance
10.4 Asserting Correctness
10.5 Assertions as Specifications
10.6 From Assertions to Verification
• Move to its own chapter "Verifying Code"
10.7 Reference Runs
• Move to "Verifying Code"
10.8 System Assertions
10.9 Checking Production Code
• Expand discussion; consider checking preconditions only

• New Chapter 11 Verifying Code, Why does my Code smell?
• Highlight tools like FindBUGS
• Defects as Abnormal Behavior
• Discuss work by Engler et al.
Assertions as Specifications
From Assertions to Verification- moved from 10.6
• Show the integration of ESC/Java and Spec# (with demos)
Reference Runs0 moved from 10.7
• Limits of Static Analysis

12 Detecting Anomalies
12.1 Capturing Normal Behavior
12.2 Comparing Coverage
12.3 Statistical Debugging
• Include and reflect recent work
• Integrate machine learning approaches
• Refer to the iBugs library
12.4 Collecting Data in the Field
12.5 Dynamic Invariants
• Discuss the AGITAR tool
12.6 Invariants on the Fly
12.7 From Anomalies to Defects

13 Causes and Effects
13.1 Causes and Alternate Worlds
13.2 Verifying Causes
13.3 Causality in Practice
13.4 Finding Actual Causes
13.5 Narrowing Down Causes
13.6 A Narrowing Example
13.7 The Common Context
13.8 Causes in Debugging

14 Isolating Failure Causes
14.1 Isolating Causes Automatically
14.2 Isolating versus Simplifying
14.3 An Isolation Algorithm
14.4 Implementing Isolation
14.5 Isolating Failure-inducing Input
14.6 Isolating Failure-inducing Schedules
14.7 Isolating Failure-inducing Changes
• Update to recent tools and screenshots
14.8 Problems and Limitations

15 Isolating Cause-Effect Chains
15.1 Useless Causes
15.2 Capturing Program States
15.3 Comparing Program States
15.4 Isolating Relevant Program States
15.5 Isolating Cause-Effect Chains
15.6 Isolating Failure-inducing Code
15.7 Issues and Risks
• Discuss how to recreate state via method calls
• New project in Python

16 Fixing the Defect
16.1 Locating the Defect
16.2 Focusing on the Most Likely Errors
16.3 Validating the Defect
16.4 Correcting the Defect
16.5 Workarounds
16.6 Learning from Mistakes
• This becomes its own chapter 17

• New chapter 17 Learning from Mistakes
• 17.1 Measuring effort and damage We want to know how much effort and cost went into each problem
• 17.2 Leveraging software archives Collect data from problem and change databases; access more of them
• 17.3 Mapping errors Which components have had the most errors in the past? Demonstrate using Eclipse and Mozilla data
• 17.4 Predicting errors Which components will have the most errors in the future?
• 17.5 What is it that makes software complex? Complexity of code; lack of quality assurance; changing requirements... and how to measure this
• 17.6 Digging for more data Goal-Question-Metric approach; experience factory
• 17.7 Continuous Improvement Space Shuttle Software

• New chapter 18 Preventing Errors
18.1 Keep Things Simple General principles of good design and coding
18.2 Know what to do Pragmatic specification (design by contract, assertions)
18.3 Know how to check General principles of quality assurance
18.4 Learn from mistakes As laid out in (new) Section 16; integrated with earlier principles
18.5 Improve process and product- keep on challenging yourself

Appendix: Formal Definitions
A.1 Delta Debugging
A.2 Memory Graphs
A.3 Cause-Effect Chains


Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously
Sort by: Showing 1 Customer Reviews
  • Anonymous

    Posted January 5, 2010

    No text was provided for this review.

Sort by: Showing 1 Customer Reviews

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)