Read an Excerpt
  Why Programs Fail 
 A Guide to Systematic Debugging 
 By Andreas Zeller 
 Morgan Kaufmann 
 Copyright © 2009   Elsevier Inc. 
All right reserved.  ISBN: 978-0-08-092300-0 
    Chapter One 
  How Failures Come to Be  
  
  Your program fails. How can this be? The answer is that the programmer creates  a defect in the code. When the code is executed, the defect causes an infection  in the program state, which later becomes visible as a failure. To find the  defect, one must reason backward, starting with the failure. This chapter defines  the essential concepts when talking about debugging, and hints at the techniques  discussed subsequently—hopefully whetting your appetite for the remainder of this  book.  
  
  1.1 MY PROGRAM DOES NOT WORK!  
  Oops! Your program fails. Now what? This is a common situation that interrupts our  routine and requires immediate attention. Because the program mostly worked until  now, we assume that something external has crept into our machine—something  that is natural and unavoidable;something we are not responsible for—namely,a bug.  
  If you are a user, you have probably already learned to live with bugs. You may  even think that bugs are unavoidable when it comes to software. As a programmer,  though, you know that bugs do not creep out of mother nature into our programs.  (See Bug Story 1 for an exception.) Rather, bugs are inherent parts of the programs  we produce. At the beginning of any bug story stands a human who produces the  program in question.  
  The following is a small program I once produced. The sample program is a very  simple sorting tool. Given a list of numbers as command-line arguments, sample  prints them as a sorted list on the standard output ($ is the command-line prompt).  
  $ ./sample 9 7 8  Output: 7 8 9  $ _  
  Unfortunately, sample does not always work properly, as demonstrated by the  following failure:  
  $ ./sample 11 14  Output: 0 11  $ _  
  Although the sample output is sorted and contains the right number of items,  some original arguments are missing and replaced by bogus numbers. Here, 14 is  missing and replaced by 0. (Actual bogus numbers and behavior on your system  may vary.) From the sample failure, we can deduce that sample has a bug (or, more  precisely, a defect). This brings us to the key question of this chapter:  
  
  1.2 FROM DEFECTS TO FAILURES  
  In general, a failure such as that in the sample program comes about in the four  stages discussed in the following.  
  1. The programmer creates a defect. A defect is a piece of the code that can cause an  infection. Because the defect is part of the code, and because every code is initially  written by a programmer, the defect is technically created by the programmer. If  the programmer creates a defect, does that mean the programmer was at fault?  Not necessarily. Consider the following:  
  * The original requirements did not foresee future changes. Think about the Y2K  problem, for instance.  
  * A program behavior may become classified as a "failure" only when the user  sees it for the first time.  
  * In a modular program, a failure may happen because of incompatible interfaces  of two modules.  
  * In a distributed program, a failure may be the result of some unpredictable  interaction of several components.  
  In such settings, deciding on who is to blame is a political, not a technical, question.  Nobody made a mistake, and yet a failure occurred. (See Bug Story 2 for  more on such failures.)  
  2. The defect causes an infection. The program is executed, and with it the defect.  The defect now creates an infection—that is, after execution of the defect, the  program state differs from what the programmer intended.  
  A defect in the code does not necessarily cause an infection. The defective  code must be executed, and it must be executed under such conditions that the  infection actually occurs.  
  3. The infection propagates. Most functions result in errors when fed with erroneous  input. As the remaining program execution accesses the state, it generates  further infections that can spread into later program states. An infection need not,  however, propagate continuously. It may be overwritten, masked, or corrected  by some later program action.  
  4. The infection causes a failure. A failure is an externally observable error in  the program behavior. It is caused by an infection in the program state.  
  The program execution process is sketched in Figure 1.1. Each program state  consists of the values of the program variables, as well as the current execution  position (formally, the program counter). Each state determines subsequent states,  up to the final state (at the bottom in the figure),in which we can observe the failure  (indicated by the x).  
  Not every defect results in an infection, and not every infection results in a failure.  Thus, having no failures does not imply having no defects. This is the curse of testing,  as pointed out by Dijkstra. Testing can only show the presence of defects, but never  their absence.  
  In the case of sample, though, we have actually experienced a failure. In hindsight, every  failure is thus caused by some infection, and every infection is caused by  some earlier infection, originating at the defect. This cause–effect chain from defect  to failure is called an infection chain.  
  The issue of debugging is thus to identify the infection chain, to find its root  cause (the defect),and to remove the defect such that the failure no longer occurs.  This is what we shall do with the sample program.  
  
  1.3 LOST IN TIME AND SPACE  
  In general, debugging of a program such as sample can be decomposed into seven  steps (List 1.1), of which the initial letters form the word TRAFFIC.  
  1. Track the problem in the database.  
  2. Reproduce the failure.  
  3. Automate and simplify the test case.  
  4. Find possible infection origins.  
  5. Focus on the most likely origins.  
  6. Isolate the infection chain.  
  7. Correct the defect.  
  
  Of these steps, tracking the problem in a problem database is mere bookkeeping  (see also Chapter 2) and reproducing the problem is not that difficult for deterministic  programs such as sample. It can be difficult for nondeterministic programs  and long-running programs, though, which is why Chapter 4 discusses the issues  involved in reproducing failures.  
  Automating the test case is also rather straightforward, and results in automatic  simplification (see also Chapter 5). The last step, correcting the defect, is  usually simple once you have understood how the defect causes the failure (see  Chapter 15).  
  The final three steps—from finding the infection origins to isolating the infection  chain—are the steps concerned with understanding how the failure came to be.  This task requires by far the most time, as well as other resources. Understanding  how the failure came to be is what the rest of this section, and the other chapters  of this book, are about.  
  Why is understanding the failure so difficult? Considering Figure 1.1,all one need  do to find the defect is isolate the transition from a sane state (i.e., noninfected, as  intended) to an infected state. This is a search in space (as we have to find out which  part of the state is infected) as well as in time (as we have to find out when the  infection takes place).  
  However, examination of space and time are enormous tasks for even the simplest  programs. Each state consists of dozens, thousands, or even millions of variables.  For example, Figure 1.2 shows a visualization of the program state of the GNU  compiler (GCC) while compiling a program. The program state consists of about  44,000 individual variables, each with a distinct value, and about 42,000 references  between variables. (Chapter 14 discusses how to obtain and use such program states  in debugging.)  
  Not only is a single state quite large, a program execution consists of thousands,  millions, or even billions of such states. Space and time thus form a wide area in  which only two points are well known (Figure 1.3): initially, the entire state is sane  (
 ), and eventually some part of the state is infected (x). Within the area spanned  by space and time, the aim of debugging is to locate the defect—a single transition  from sane ( ) to infected ( ) that eventually causes the failure (Figure 1.4).    Thinking about the dimensions of space and time, this may seem like searching  for a needle in an endless row of haystacks—and indeed, the fact is that debugging  is largely a search problem. This search is driven by the following two major  principles:  
  1. Separate sane from infected. If a state is infected, it may be part of the infection  propagating from defect to failure. If a state is sane, there is no infection  to propagate.  
  2. Separate relevant from irrelevant. A variable value is the result of a limited  number of earlier variable values. Thus, only some part of the earlier state may  be relevant to the failure.  
  
  Figure 1.5 illustrates this latter technique. The failure, to reiterate, can only have  been caused by a small number of other variables in earlier states (denoted using  the exclamation point, !), the values of which in turn can only have come from  other earlier variables. One says that subsequent, variable values depend on earlier  values. This results in a series of dependences from the failure back to earlier variable  values. To locate the defect, it suffices to examine these values only—as other values  could not have possibly caused the failure—and separate these values into sane and  infected. If we find an infected value, we must find and fix the defect that causes it.  Typically, this is the same defect that causes the original failure.  
  Why is it that a variable value can be caused only by a small number of earlier  variables? Good programming style dictates division of the state into units such that  the information flow between these units is minimized. Typically, your programming  language provides a means of structuring the state, just as it helps you to structure  the program code. However, whether you divide the state into functions, modules,  objects, packages, or components, the principle is the same: a divided state is much  easier to conquer.  
  
  1.4 FROM FAILURES TO FIXES  
  Let's put our knowledge about states and dependences into practice, following the  TRAFFIC steps (List 1.1).  
  
  1.4.1 Track the Problem  
  The first step in debugging is to track the problem—that is, to file a problem report  such that the defect will not go by unnoticed. In our case, we have already observed  the failure symptom: the output of sample, when invoked with arguments 11 and  14,contains a zero.  
  $ ./sample 11 14  Output: 0 11  $ _  
  An actual problem report would include this invocation as an instruction on how  to reproduce the problem (see Chapter 2).  
  
  1.4.2 Reproduce the Failure  
  In case of the sample program, reproducing the failure is easy. All you need  do is reinvoke sample, as shown previously. In other cases, though, reproducing  may require control over all possible input sources (techniques are described in  Chapter 4).  
  
  1.4.3 Automate and Simplify the Test Case  
  If sample were a more complex program, we would have to think about how to  automate the failure (in that we want to reproduce the failure automatically) and  how to simplify its input such that we obtain a minimal test case. In the case of  sample, though, this is not necessary (for more complex programs, Chapter 5 covers  the details).  
  
  1.4.4 Find Possible Infection Origins  
  Where does the zero in the output come from? This is the fourth step in the TRAFFIC  steps: We must find possible infection origins. To find possible origins, we need the  actual C source code of sample, shown in Example 1.1. We quickly see that the  program consists of two functions:shell_sort() (which implements the shell sort  algorithm) and main, which realizes a simple test driver around shell_sort(). The  main function:  
  * Allocates an array a (line 32).  
  * Copies the command-line arguments into a (lines 33–34).  
  * Sorts a by invoking shell_sort() (line 36).  
  * Prints the content of a (lines 38–41).  
  
  By matching the output to the appropriate code, we find that the 0 printed  by sample is the value of the variable a[0], the first element of the array a.  This is the infection we observe: at line 39 in sample.c, variable a[0] is obviously  zero.  
  Where does the zero in a[0] come from? Working our way backward from line 40,  we find in line 36 the call shell_sort(a, argc), where the array a is passed  by reference. This function might well be the point at which a[0] was assigned the  infected value.  
  (Continues...)  
  
     
 
 Excerpted from Why Programs Fail by Andreas Zeller  Copyright © 2009   by Elsevier Inc. .   Excerpted by permission of Morgan Kaufmann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.