Why Programs Fail
A Guide to Systematic Debugging
By Andreas Zeller
Copyright © 2009 Elsevier Inc.
All right reserved. ISBN: 978-0-08-092300-0
How Failures Come to Be
Your program fails. How can this be? The answer is that the programmer creates a defect in the code. When the code is executed, the defect causes an infection in the program state, which later becomes visible as a failure. To find the defect, one must reason backward, starting with the failure. This chapter defines the essential concepts when talking about debugging, and hints at the techniques discussed subsequently—hopefully whetting your appetite for the remainder of this book.
1.1 MY PROGRAM DOES NOT WORK!
Oops! Your program fails. Now what? This is a common situation that interrupts our routine and requires immediate attention. Because the program mostly worked until now, we assume that something external has crept into our machine—something that is natural and unavoidable;something we are not responsible for—namely,a bug.
If you are a user, you have probably already learned to live with bugs. You may even think that bugs are unavoidable when it comes to software. As a programmer, though, you know that bugs do not creep out of mother nature into our programs. (See Bug Story 1 for an exception.) Rather, bugs are inherent parts of the programs we produce. At the beginning of any bug story stands a human who produces the program in question.
The following is a small program I once produced. The sample program is a very simple sorting tool. Given a list of numbers as command-line arguments, sample prints them as a sorted list on the standard output ($ is the command-line prompt).
$ ./sample 9 7 8 Output: 7 8 9 $ _
Unfortunately, sample does not always work properly, as demonstrated by the following failure:
$ ./sample 11 14 Output: 0 11 $ _
Although the sample output is sorted and contains the right number of items, some original arguments are missing and replaced by bogus numbers. Here, 14 is missing and replaced by 0. (Actual bogus numbers and behavior on your system may vary.) From the sample failure, we can deduce that sample has a bug (or, more precisely, a defect). This brings us to the key question of this chapter:
1.2 FROM DEFECTS TO FAILURES
In general, a failure such as that in the sample program comes about in the four stages discussed in the following.
1. The programmer creates a defect. A defect is a piece of the code that can cause an infection. Because the defect is part of the code, and because every code is initially written by a programmer, the defect is technically created by the programmer. If the programmer creates a defect, does that mean the programmer was at fault? Not necessarily. Consider the following:
* The original requirements did not foresee future changes. Think about the Y2K problem, for instance.
* A program behavior may become classified as a "failure" only when the user sees it for the first time.
* In a modular program, a failure may happen because of incompatible interfaces of two modules.
* In a distributed program, a failure may be the result of some unpredictable interaction of several components.
In such settings, deciding on who is to blame is a political, not a technical, question. Nobody made a mistake, and yet a failure occurred. (See Bug Story 2 for more on such failures.)
2. The defect causes an infection. The program is executed, and with it the defect. The defect now creates an infection—that is, after execution of the defect, the program state differs from what the programmer intended.
A defect in the code does not necessarily cause an infection. The defective code must be executed, and it must be executed under such conditions that the infection actually occurs.
3. The infection propagates. Most functions result in errors when fed with erroneous input. As the remaining program execution accesses the state, it generates further infections that can spread into later program states. An infection need not, however, propagate continuously. It may be overwritten, masked, or corrected by some later program action.
4. The infection causes a failure. A failure is an externally observable error in the program behavior. It is caused by an infection in the program state.
The program execution process is sketched in Figure 1.1. Each program state consists of the values of the program variables, as well as the current execution position (formally, the program counter). Each state determines subsequent states, up to the final state (at the bottom in the figure),in which we can observe the failure (indicated by the x).
Not every defect results in an infection, and not every infection results in a failure. Thus, having no failures does not imply having no defects. This is the curse of testing, as pointed out by Dijkstra. Testing can only show the presence of defects, but never their absence.
In the case of sample, though, we have actually experienced a failure. In hindsight, every failure is thus caused by some infection, and every infection is caused by some earlier infection, originating at the defect. This cause–effect chain from defect to failure is called an infection chain.
The issue of debugging is thus to identify the infection chain, to find its root cause (the defect),and to remove the defect such that the failure no longer occurs. This is what we shall do with the sample program.
1.3 LOST IN TIME AND SPACE
In general, debugging of a program such as sample can be decomposed into seven steps (List 1.1), of which the initial letters form the word TRAFFIC.
1. Track the problem in the database.
2. Reproduce the failure.
3. Automate and simplify the test case.
4. Find possible infection origins.
5. Focus on the most likely origins.
6. Isolate the infection chain.
7. Correct the defect.
Of these steps, tracking the problem in a problem database is mere bookkeeping (see also Chapter 2) and reproducing the problem is not that difficult for deterministic programs such as sample. It can be difficult for nondeterministic programs and long-running programs, though, which is why Chapter 4 discusses the issues involved in reproducing failures.
Automating the test case is also rather straightforward, and results in automatic simplification (see also Chapter 5). The last step, correcting the defect, is usually simple once you have understood how the defect causes the failure (see Chapter 15).
The final three steps—from finding the infection origins to isolating the infection chain—are the steps concerned with understanding how the failure came to be. This task requires by far the most time, as well as other resources. Understanding how the failure came to be is what the rest of this section, and the other chapters of this book, are about.
Why is understanding the failure so difficult? Considering Figure 1.1,all one need do to find the defect is isolate the transition from a sane state (i.e., noninfected, as intended) to an infected state. This is a search in space (as we have to find out which part of the state is infected) as well as in time (as we have to find out when the infection takes place).
However, examination of space and time are enormous tasks for even the simplest programs. Each state consists of dozens, thousands, or even millions of variables. For example, Figure 1.2 shows a visualization of the program state of the GNU compiler (GCC) while compiling a program. The program state consists of about 44,000 individual variables, each with a distinct value, and about 42,000 references between variables. (Chapter 14 discusses how to obtain and use such program states in debugging.)
Not only is a single state quite large, a program execution consists of thousands, millions, or even billions of such states. Space and time thus form a wide area in which only two points are well known (Figure 1.3): initially, the entire state is sane ( ), and eventually some part of the state is infected (x). Within the area spanned by space and time, the aim of debugging is to locate the defect—a single transition from sane ( ) to infected ( ) that eventually causes the failure (Figure 1.4).
Thinking about the dimensions of space and time, this may seem like searching for a needle in an endless row of haystacks—and indeed, the fact is that debugging is largely a search problem. This search is driven by the following two major principles:
1. Separate sane from infected. If a state is infected, it may be part of the infection propagating from defect to failure. If a state is sane, there is no infection to propagate.
2. Separate relevant from irrelevant. A variable value is the result of a limited number of earlier variable values. Thus, only some part of the earlier state may be relevant to the failure.
Figure 1.5 illustrates this latter technique. The failure, to reiterate, can only have been caused by a small number of other variables in earlier states (denoted using the exclamation point, !), the values of which in turn can only have come from other earlier variables. One says that subsequent, variable values depend on earlier values. This results in a series of dependences from the failure back to earlier variable values. To locate the defect, it suffices to examine these values only—as other values could not have possibly caused the failure—and separate these values into sane and infected. If we find an infected value, we must find and fix the defect that causes it. Typically, this is the same defect that causes the original failure.
Why is it that a variable value can be caused only by a small number of earlier variables? Good programming style dictates division of the state into units such that the information flow between these units is minimized. Typically, your programming language provides a means of structuring the state, just as it helps you to structure the program code. However, whether you divide the state into functions, modules, objects, packages, or components, the principle is the same: a divided state is much easier to conquer.
1.4 FROM FAILURES TO FIXES
Let's put our knowledge about states and dependences into practice, following the TRAFFIC steps (List 1.1).
1.4.1 Track the Problem
The first step in debugging is to track the problem—that is, to file a problem report such that the defect will not go by unnoticed. In our case, we have already observed the failure symptom: the output of sample, when invoked with arguments 11 and 14,contains a zero.
$ ./sample 11 14 Output: 0 11 $ _
An actual problem report would include this invocation as an instruction on how to reproduce the problem (see Chapter 2).
1.4.2 Reproduce the Failure
In case of the sample program, reproducing the failure is easy. All you need do is reinvoke sample, as shown previously. In other cases, though, reproducing may require control over all possible input sources (techniques are described in Chapter 4).
1.4.3 Automate and Simplify the Test Case
If sample were a more complex program, we would have to think about how to automate the failure (in that we want to reproduce the failure automatically) and how to simplify its input such that we obtain a minimal test case. In the case of sample, though, this is not necessary (for more complex programs, Chapter 5 covers the details).
1.4.4 Find Possible Infection Origins
Where does the zero in the output come from? This is the fourth step in the TRAFFIC steps: We must find possible infection origins. To find possible origins, we need the actual C source code of sample, shown in Example 1.1. We quickly see that the program consists of two functions:shell_sort() (which implements the shell sort algorithm) and main, which realizes a simple test driver around shell_sort(). The main function:
* Allocates an array a (line 32).
* Copies the command-line arguments into a (lines 33–34).
* Sorts a by invoking shell_sort() (line 36).
* Prints the content of a (lines 38–41).
By matching the output to the appropriate code, we find that the 0 printed by sample is the value of the variable a, the first element of the array a. This is the infection we observe: at line 39 in sample.c, variable a is obviously zero.
Where does the zero in a come from? Working our way backward from line 40, we find in line 36 the call shell_sort(a, argc), where the array a is passed by reference. This function might well be the point at which a was assigned the infected value.
Excerpted from Why Programs Fail by Andreas Zeller Copyright © 2009 by Elsevier Inc. . Excerpted by permission of Morgan Kaufmann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.