Read an Excerpt
RELIABILITY THEORY AND PRACTICE
By Igor Bazovsky
Dover Publications, Inc.Copyright © 2004 DOVER PUBLICATIONS, INC.
All rights reserved.
THE CONCEPT OF RELIABILITY
WHAT IS RELIABILITY? Dictionaries and encyclopedias have various interpretations of this word, ranking it in the category of abstract concepts, such as goodness, beauty, honesty. Abstract concepts may mean different things to different people and are most difficult to define. What appears beautiful to one man may appear awkward to another. Concepts which are difficult to define become even more difficult to measure.
However, in engineering and in mathematical statistics, reliability has an exact meaning. Not only can it be exactly defined, but it can also be calculated, objectively evaluated, measured, tested, and even designed into a piece of engineering equipment. Thus, for us engineers, reliability is far from an abstract concept. On the contrary, reliability is a very harsh reality. It ranks on the same level with the performance of an equipment and, quite often, it is even more important than performance.
When a jet engine is designed for a certain thrust or an electronic equipment for a certain gain, and it happens in operation that the thrust or the gain is less than that called for by the specification, such performance although not the best, may under certain circumstances still be satisfactory, and the engine or the electronic equipment may turn out to be very reliable. On the other hand, an engine of a different design may supply with ease the full thrust so that it complies with all the performance specifications—but it may suddenly break down in operation. Here is where reliability enters.
Equipment breakdown can become a nightmare for the engineer—whether he is engaged in design, manufacture, maintenance, or operation. It affects not only the engineer and the manufacturer; quite often the user of the equipment bears the heaviest consequences. The price of unreliability is very high. The cure is reliability engineering.
From the beginning of the industrial age reliability problems have had to be considered. A classical example is ball and roller bearings; extensive studies of their life characteristics have been made since the early days of railroad transportation. Another example of the reliability approach is the design of equipment for a certain life, which dates back fifty or one hundred years. At first, reliability was confined to mechanical equipment. However, with the advent of electrification considerable effort went into making the supply of electric power reliable. Parallel operation of generators, transformers, and transmission lines and the interlinking of high-voltage supply lines into nationwide and continental grid systems have served the main purpose of keeping the supply of electric power as reliable as possible. It is not an exaggeration to say that the supply of utility electric power is nowadays almost one hundred per cent reliable. But it was far from that in the first decades of this century, as many people will remember. Parallel operation, redundancy, and better equipment have solved what was once a real problem. With the advent of aircraft came the reliability problems connected with airborne equipment, which were more difficult to solve than reliability problems of stationary or land-transportation equipment. But remarkable progress was made even in this field, mainly because of the great aircraft designers of the last few decades and their ingenious, intuitive approach.
Reliability entered a new era with the advent of the electronic age, the age of jet aircraft flying at sonic and supersonic speeds, and the age of missiles and space vehicles. Whereas originally the reliability problem had been approached by using very high safety factors which tremendously added to the weight of the equipment, or by extensive use of redundancy which again added to the over-all weight, or by learning from the failures and breakdowns of previous designs when designing new equipment and systems of a similar configuration, these approaches suddenly became impractical for the new types of airborne and electronic equipment. The terrific pace of aircraft and missile development and the miracles of modern electronics were combined with an urgent call for drastic reduction of equipment weight and size to allow the thousands and tens of thousands of necessary components to be squeezed together in small volumes. Time was running out on those who had hoped to wait to learn from mistakes made on previous designs. The next design had to be radically different from the previous one because within a few very short years the technical sciences would have again made big strides forward. Very little use could be made of the experience gained from previous mistakes; there was neither time nor money left for redesigning—both had to be made available for new projects. This rapid progress has not yet come to an end. On the contrary, the pace is increasing and will continue to increase. Therefore, the reliability problem has become more and more severe from year to year. The intuitive approach and the redesign approach have had to make way for an entirely new approach to reliability—statistically defined, calculated, and designed.
Thus, the engineer who wants to keep pace with technical developments and the manufacturer who wants to remain in business must become familiar with the new concept of reliability, and they must apply the new reliability methods in their everyday work.
Stated simply, reliability is the capability of an equipment not to break down in operation. When an equipment works well, and works whenever called upon to do the job for which it was designed, such equipment is said to be reliable. Satisfactory performance without breakdowns while in use and readiness to perform at the desired time are the criteria of an equipment's reliability. The equipment may be a simple device, such as a switch, a diode, or a connection, or it may be a very complex machine, such as a computer, a radar, an aircraft, a missile, or any of their subsystems. The reliability of complex equipment depends on the reliability of its components. There exists a very exact mathematical relation between the parts' reliabilities and the complex-system reliability, as we shall soon learn.
The measure of an equipment's reliability is the frequency at which failures occur in time. If there are no failures, the equipment is one hundred per cent reliable; if the failure frequency is very low, the equipment's reliability is usually still acceptable; if the failure frequency is high, the equipment is unreliable.
A well-designed, well-engineered, thoroughly tested, and properly maintained equipment should never fail in operation. However, experience shows that even the best design, manufacturing, and maintenance efforts do not completely eliminate the occurrence of failures. Reliability distinguishes three characteristic types of failures (excluding damage caused by careless handling, storing, or improper operation by the users) which may be inherent in the equipment and occur without any fault on the part of the operator.
First, there are the failures which occur early in the life of a component. They are called early failures and in most cases result from poor manufacturing and quality-control techniques during the production process. A few substandard specimens in a lot of otherwise fine components can easily sneak through the manufacturing process. Or, during the assembly of an equipment a poor connection may go through unnoticed. Such errors are bound to cause trouble, and the failures which then inevitably occur take place usually during the first minutes or hours of operation. Early failures can be eliminated by the so-called "debugging" or "burn-in" process. The debugging process consists of operating an equipment for a number of hours under conditions simulating actual use; when weak, substandard components fail in these early hours of the equipment's operation, they are replaced by good components; when poor solder connections or other assembly faults show up, they are corrected. Only then is the equipment released for service. The burn-in process consists of operating a lot of components under simulated conditions for a number of hours, and then using the components which survive for the assembly of the equipment.
Secondly, there are failures which are caused by wearout of parts. These occur in an equipment only if it is not properly maintained—or not maintained at all. Wearout failures are a symptom of component aging. The age at which wearout occurs differs widely with components, ranging from minutes to years. In most cases wearout failures can be prevented. For instance, in repeatedly operated equipment one method is to replace at regular intervals the accessible parts which are known to be subject to wearout, and to make the replacement intervals shorter than the mean wearout life of the parts. Or, when the parts are inaccessible, they are designed for a longer life than the intended life of the equipment. This second method is also applied to so-called "one-shot" equipment, such as missiles, which are used only once during their lifetime.
Thirdly, there are so-called "chance" failures which neither good debugging techniques nor the best maintenance practices can eliminate. These failures are caused by sudden stress accumulations beyond the design strength of the component. Chance failures occur at random intervals, irregularly and unexpectedly. No one can predict when chance failures will occur; however, they obey certain rules of collective behavior so that the frequency of their occurrence during sufficiently long periods is approximately constant. Chance failures are sometimes called "catastrophic" failures, which is inaccurate because early failures and wearout failures can be as catastrophic as chance failures, and chance failures are not necessarily "catastrophic" for the equipment in which they occur.
It is not normally easy to eliminate chance failures. However, reliability techniques have been developed which can reduce the chance of their occurrence and therefore reduce their number to a minimum within a given time interval, or even completely eliminate equipment breakdowns resulting from component chance failures.
Reliability theory and practice differentiates between early, wearout, and chance failures for two main reasons. First, each of these types of failures follows a specific statistical distribution and therefore requires a different mathematical treatment. Secondly, different methods must be used for their elimination.
Because the failure-free operation of certain equipment is vital to the preservation of human lives, to defense, and to industry, it must be highly reliable. In such equipment, early failures should be eliminated by thorough prolonged testing and check-out before it is put into service. Wearout failures should be excluded by correctly scheduled, good preventive practices. Then, if failures still occur during the operational life of the equipment, they will almost certainly be chance failures. Therefore, when such equipment is in operational use, its performance reliability is determined by the frequency of the chance failure occurrence.
Reliability engineering is concerned with eliminating early failures by observing their distribution and determining accordingly the length of the necessary debugging period and the debugging methods to be followed. Further, it is concerned with preventing wearout failures by observing the statistical distribution of wearout and determining the overhaul or preventive replacement periods for the various parts or their design life. Finally, its main attention is focused on chance failures and their prevention, reduction, or complete elimination because it is the chance failure phenomenon which most undesirably affects equipment reliability in actual operational use—in the period after the equipment has been debugged and before parts begin to wear out. For long-life equipment this amounts to the period between overhauls.
A word of caution is important here. Unfortunately, all too often not enough pains are taken to eliminate early failures completely and to prevent wearout failures. Early failures can creep into an equipment every time it is overhauled or repaired, either by an improper selection of replacement components for those that have failed and those approaching a wearout condition, or by some faulty connection (such as a solder joint), or by some other adjustments in the system which are not made properly when repair action is taken. Such poor repair practices may introduce early failures into the equipment time and again throughout its operational life; the system or equipment can never become reliable even though, with good repair practices and considering only chance failures, it might be a very reliable piece of engineering work. In a similar way wearout failures can make an inherently very reliable equipment extremely unreliable. However, such unreliability is mostly caused by negligence (for example, by not following the maintenance rules). The equipment is then not at fault.
Reliability is a yardstick of the capability of an equipment to operate without failures when put into service. Reliability predicts mathematically the equipment's behavior under expected operating conditions. More specifically, reliability expresses in numbers the chance of an equipment to operate without failure for a given length of time in an environment for which it was designed.
It is known from mathematical statistics that exact formulas exist for the frequency of occurrence of events following various kinds of statistical distributions, and from these the chance or probability of the occurrence of these events can be derived. In reliability we are concerned with events which occur in the time domain. For instance, wearout failures usually cluster around the mean wearout life of components. Once their distribution is known, the probability of wearout failure occurrence at any operating age of the component can be mathematically calculated. Similar considerations apply to early failures and to chance failures. However, early and chance failures do not cluster around any mean life but occur at random intervals. They therefore belong in the category of random events or stochastic processes and have their own characteristic distribution which is different from wearout failures. Although the time of occurrence of failures which occur at random time intervals cannot be predicted, the probability of the occurrence or nonoccurrence of such failures in an operating interval of a given length can be calculated by means of the theory of probability.
By its most primitive definition, reliability is the probability that no failure will occur in a given time interval of operation. This time interval may be a single operation, such as a mission, or a number of consecutive operations or missions. The opposite of reliability is unreliability, which is defined as the probability of failure in the same time interval.
Excerpted from RELIABILITY THEORY AND PRACTICE by Igor Bazovsky. Copyright © 2004 DOVER PUBLICATIONS, INC.. Excerpted by permission of Dover Publications, Inc..
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.