Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

Why Programs Fail: A Guide to Systematic Debugging

Why Programs Fail: A Guide to Systematic Debugging

by Andreas Zeller

See All Formats & Editions

This book is proof that debugging has graduated from a black art to a systematic discipline. It demystifies one of the toughest aspects of software programming, showing clearly how to discover what caused software failures, and fix them with minimal muss and fuss.

The fully updated second edition includes 100+ pages of new material, including new chapters on


This book is proof that debugging has graduated from a black art to a systematic discipline. It demystifies one of the toughest aspects of software programming, showing clearly how to discover what caused software failures, and fix them with minimal muss and fuss.

The fully updated second edition includes 100+ pages of new material, including new chapters on Verifying Code, Predicting Erors, and Preventing Errors. Cutting-edge tools such as FindBUGS and AGITAR are explained, techniques from integrated environments like Jazz.net are highlighted, and all-new demos with ESC/Java and Spec#, Eclipse and Mozilla are included.

This complete and pragmatic overview of debugging is authored by Andreas Zeller, the talented researcher who developed the GNU Data Display Debugger(DDD), a tool that over 250,000 professionals use to visualize the data structures of programs while they are running. Unlike other books on debugging, Zeller's text is product agnostic, appropriate for all programming languages and skill levels.

The book explains best practices ranging from systematically tracking error reports, to observing symptoms, reproducing errors, and correcting defects. It covers a wide range of tools and techniques from hands-on observation to fully automated diagnoses, and also explores the author's innovative techniques for isolating minimal input to reproduce an error and for tracking cause and effect through a program. It even includes instructions on how to create automated debugging tools.

The text includes exercises and extensive references for further study, and a companion website with source code for all examples and additional debugging resources is available.

*The new edition of this award-winning productivity-booster is for any developer who has ever been frustrated by elusive bugs

*Brand new chapters demonstrate cutting-edge debugging techniques and tools, enabling readers to put the latest time-saving developments to work for them

*Learn by doing. New exercises and detailed examples focus on emerging tools, languages and environments, including AGITAR, FindBUGS, Python and Eclipse.

Editorial Reviews

From the Publisher
Praise from the experts for the first edition:

"In this book, Andreas Zeller does an excellent job introducing useful debugging techniques and tools invented in both academia and industry. The book is easy to read and actually very fun as well. It will not only help you discover a new perspective on debugging, but it will also teach you some fundamental static and dynamic program analysis techniques in plain language."
-Miryung Kim, Software Developer, Motorola Korea

"Today every computer program written is also debugged, but debugging is not a widely studied or taught skill. Few books beyond this one present a systematic approach to finding and fixing programming errors."
-James Larus, Microsoft Research

"From the author of ODD, the famous data display debugger, now comes the definitive book on debugging. Zeller's book is chock-full with advice, insight, and tools to track down defects in programs, for all levels of experience and any programming language. The book is lucidly written, explaining the principles of every technique without boring the reader with minutiae. And best of all, at the end of each chapter it tells you where to download all those fancy tools. A great book for the software professional as well as the student interested in the frontiers of automated debugging."
-Walter F. Tichy, Professor, University Karlsruhe, Germany

"Andreas Zeller's Why Programs Fail lays an excellent foundation far practitioners, educators, and researchers alike. Using a disciplined approach based on the scientific method, Zeller provides deep insights, detailed approaches, and illustrative examples."
-David Notkin, Professor Computer Science & Engineering, University of Washington

Product Details

Elsevier Science
Publication date:
Sold by:
Barnes & Noble
File size:
12 MB
This product may take a few minutes to download.

Read an Excerpt

Why Programs Fail

A Guide to Systematic Debugging
By Andreas Zeller

Morgan Kaufmann

Copyright © 2009 Elsevier Inc.
All right reserved.

ISBN: 978-0-08-092300-0

Chapter One

How Failures Come to Be

Your program fails. How can this be? The answer is that the programmer creates a defect in the code. When the code is executed, the defect causes an infection in the program state, which later becomes visible as a failure. To find the defect, one must reason backward, starting with the failure. This chapter defines the essential concepts when talking about debugging, and hints at the techniques discussed subsequently—hopefully whetting your appetite for the remainder of this book.


Oops! Your program fails. Now what? This is a common situation that interrupts our routine and requires immediate attention. Because the program mostly worked until now, we assume that something external has crept into our machine—something that is natural and unavoidable;something we are not responsible for—namely,a bug.

If you are a user, you have probably already learned to live with bugs. You may even think that bugs are unavoidable when it comes to software. As a programmer, though, you know that bugs do not creep out of mother nature into our programs. (See Bug Story 1 for an exception.) Rather, bugs are inherent parts of the programs we produce. At the beginning of any bug story stands a human who produces the program in question.

The following is a small program I once produced. The sample program is a very simple sorting tool. Given a list of numbers as command-line arguments, sample prints them as a sorted list on the standard output ($ is the command-line prompt).

$ ./sample 9 7 8 Output: 7 8 9 $ _

Unfortunately, sample does not always work properly, as demonstrated by the following failure:

$ ./sample 11 14 Output: 0 11 $ _

Although the sample output is sorted and contains the right number of items, some original arguments are missing and replaced by bogus numbers. Here, 14 is missing and replaced by 0. (Actual bogus numbers and behavior on your system may vary.) From the sample failure, we can deduce that sample has a bug (or, more precisely, a defect). This brings us to the key question of this chapter:


In general, a failure such as that in the sample program comes about in the four stages discussed in the following.

1. The programmer creates a defect. A defect is a piece of the code that can cause an infection. Because the defect is part of the code, and because every code is initially written by a programmer, the defect is technically created by the programmer. If the programmer creates a defect, does that mean the programmer was at fault? Not necessarily. Consider the following:

* The original requirements did not foresee future changes. Think about the Y2K problem, for instance.

* A program behavior may become classified as a "failure" only when the user sees it for the first time.

* In a modular program, a failure may happen because of incompatible interfaces of two modules.

* In a distributed program, a failure may be the result of some unpredictable interaction of several components.

In such settings, deciding on who is to blame is a political, not a technical, question. Nobody made a mistake, and yet a failure occurred. (See Bug Story 2 for more on such failures.)

2. The defect causes an infection. The program is executed, and with it the defect. The defect now creates an infection—that is, after execution of the defect, the program state differs from what the programmer intended.

A defect in the code does not necessarily cause an infection. The defective code must be executed, and it must be executed under such conditions that the infection actually occurs.

3. The infection propagates. Most functions result in errors when fed with erroneous input. As the remaining program execution accesses the state, it generates further infections that can spread into later program states. An infection need not, however, propagate continuously. It may be overwritten, masked, or corrected by some later program action.

4. The infection causes a failure. A failure is an externally observable error in the program behavior. It is caused by an infection in the program state.

The program execution process is sketched in Figure 1.1. Each program state consists of the values of the program variables, as well as the current execution position (formally, the program counter). Each state determines subsequent states, up to the final state (at the bottom in the figure),in which we can observe the failure (indicated by the x).

Not every defect results in an infection, and not every infection results in a failure. Thus, having no failures does not imply having no defects. This is the curse of testing, as pointed out by Dijkstra. Testing can only show the presence of defects, but never their absence.

In the case of sample, though, we have actually experienced a failure. In hindsight, every failure is thus caused by some infection, and every infection is caused by some earlier infection, originating at the defect. This cause–effect chain from defect to failure is called an infection chain.

The issue of debugging is thus to identify the infection chain, to find its root cause (the defect),and to remove the defect such that the failure no longer occurs. This is what we shall do with the sample program.


In general, debugging of a program such as sample can be decomposed into seven steps (List 1.1), of which the initial letters form the word TRAFFIC.

1. Track the problem in the database.

2. Reproduce the failure.

3. Automate and simplify the test case.

4. Find possible infection origins.

5. Focus on the most likely origins.

6. Isolate the infection chain.

7. Correct the defect.

Of these steps, tracking the problem in a problem database is mere bookkeeping (see also Chapter 2) and reproducing the problem is not that difficult for deterministic programs such as sample. It can be difficult for nondeterministic programs and long-running programs, though, which is why Chapter 4 discusses the issues involved in reproducing failures.

Automating the test case is also rather straightforward, and results in automatic simplification (see also Chapter 5). The last step, correcting the defect, is usually simple once you have understood how the defect causes the failure (see Chapter 15).

The final three steps—from finding the infection origins to isolating the infection chain—are the steps concerned with understanding how the failure came to be. This task requires by far the most time, as well as other resources. Understanding how the failure came to be is what the rest of this section, and the other chapters of this book, are about.

Why is understanding the failure so difficult? Considering Figure 1.1,all one need do to find the defect is isolate the transition from a sane state (i.e., noninfected, as intended) to an infected state. This is a search in space (as we have to find out which part of the state is infected) as well as in time (as we have to find out when the infection takes place).

However, examination of space and time are enormous tasks for even the simplest programs. Each state consists of dozens, thousands, or even millions of variables. For example, Figure 1.2 shows a visualization of the program state of the GNU compiler (GCC) while compiling a program. The program state consists of about 44,000 individual variables, each with a distinct value, and about 42,000 references between variables. (Chapter 14 discusses how to obtain and use such program states in debugging.)

Not only is a single state quite large, a program execution consists of thousands, millions, or even billions of such states. Space and time thus form a wide area in which only two points are well known (Figure 1.3): initially, the entire state is sane (

  •  ), and eventually some part of the state is infected (x). Within the area spanned by space and time, the aim of debugging is to locate the defect—a single transition from sane (
  •  ) to infected (
  •  ) that eventually causes the failure (Figure 1.4).

    Thinking about the dimensions of space and time, this may seem like searching for a needle in an endless row of haystacks—and indeed, the fact is that debugging is largely a search problem. This search is driven by the following two major principles:

    1. Separate sane from infected. If a state is infected, it may be part of the infection propagating from defect to failure. If a state is sane, there is no infection to propagate.

    2. Separate relevant from irrelevant. A variable value is the result of a limited number of earlier variable values. Thus, only some part of the earlier state may be relevant to the failure.

    Figure 1.5 illustrates this latter technique. The failure, to reiterate, can only have been caused by a small number of other variables in earlier states (denoted using the exclamation point, !), the values of which in turn can only have come from other earlier variables. One says that subsequent, variable values depend on earlier values. This results in a series of dependences from the failure back to earlier variable values. To locate the defect, it suffices to examine these values only—as other values could not have possibly caused the failure—and separate these values into sane and infected. If we find an infected value, we must find and fix the defect that causes it. Typically, this is the same defect that causes the original failure.

    Why is it that a variable value can be caused only by a small number of earlier variables? Good programming style dictates division of the state into units such that the information flow between these units is minimized. Typically, your programming language provides a means of structuring the state, just as it helps you to structure the program code. However, whether you divide the state into functions, modules, objects, packages, or components, the principle is the same: a divided state is much easier to conquer.


    Let's put our knowledge about states and dependences into practice, following the TRAFFIC steps (List 1.1).

    1.4.1 Track the Problem

    The first step in debugging is to track the problem—that is, to file a problem report such that the defect will not go by unnoticed. In our case, we have already observed the failure symptom: the output of sample, when invoked with arguments 11 and 14,contains a zero.

    $ ./sample 11 14 Output: 0 11 $ _

    An actual problem report would include this invocation as an instruction on how to reproduce the problem (see Chapter 2).

    1.4.2 Reproduce the Failure

    In case of the sample program, reproducing the failure is easy. All you need do is reinvoke sample, as shown previously. In other cases, though, reproducing may require control over all possible input sources (techniques are described in Chapter 4).

    1.4.3 Automate and Simplify the Test Case

    If sample were a more complex program, we would have to think about how to automate the failure (in that we want to reproduce the failure automatically) and how to simplify its input such that we obtain a minimal test case. In the case of sample, though, this is not necessary (for more complex programs, Chapter 5 covers the details).

    1.4.4 Find Possible Infection Origins

    Where does the zero in the output come from? This is the fourth step in the TRAFFIC steps: We must find possible infection origins. To find possible origins, we need the actual C source code of sample, shown in Example 1.1. We quickly see that the program consists of two functions:shell_sort() (which implements the shell sort algorithm) and main, which realizes a simple test driver around shell_sort(). The main function:

    * Allocates an array a (line 32).

    * Copies the command-line arguments into a (lines 33–34).

    * Sorts a by invoking shell_sort() (line 36).

    * Prints the content of a (lines 38–41).

    By matching the output to the appropriate code, we find that the 0 printed by sample is the value of the variable a[0], the first element of the array a. This is the infection we observe: at line 39 in sample.c, variable a[0] is obviously zero.

    Where does the zero in a[0] come from? Working our way backward from line 40, we find in line 36 the call shell_sort(a, argc), where the array a is passed by reference. This function might well be the point at which a[0] was assigned the infected value.


    Excerpted from Why Programs Fail by Andreas Zeller Copyright © 2009 by Elsevier Inc. . Excerpted by permission of Morgan Kaufmann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
    Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

  • Meet the Author

    Andreas Zeller is a full professor for Software Engineering at Saarland University in Saarbruecken, Germany. His research concerns the analysis of large software systems and their development process; his students are funded by companies like Google, Microsoft, or SAP. In 2010, Zeller was inducted as Fellow of the ACM for his contributions to automated debugging and mining software archives. In 2011, he received an ERC Advanced Grant, Europe's highest and most prestigious individual research grant, for work on specification mining and test case generation. His book "Why programs fail", the "standard reference on debugging", obtained the 2006 Software Development Jolt Productivity Award.

    Customer Reviews

    Average Review:

    Post to your social network


    Most Helpful Customer Reviews

    See all customer reviews