Content uploaded by David Valis
Author content
All content in this area was uploaded by David Valis on Mar 31, 2015
Content may be subject to copyright.
D. Valis and L. M. Bartlett
*Corresponding Authors Email: L.M.Bartlett@lboro.ac.uk
1
The Failure Phenomenon
D.VALISa and L.M.BARTLETTb*
aUniversity of Defence, Brno, Czech Republic
bLoughborough University, Loughborough, United Kingdom
Abstract: Throughout every day life there are many events encountered where their
causes, mechanisms of development and consequences are very diverse. In undertaking
a safety or risk assessment it is the concept of the events´ description that is often of
importance. In pure technical applications these events are related to the occurrence of
failure, be it of equipment, a device, a system or an item. The theory speaks about
failure itself, its mechanisms and circumstances of occurrence, but at the same time
appropriate terminology is needed to describe these conditions. For observing, dealing
and handling failures a probabilistic or deterministic (logic) approach can be followed.
To gain information about a failure it may need to be found or transferred from a variety
of sources. This paper considers the complex, sometimes problematic, area of the term
“failure” and its related characteristics. The contribution aims to detail the total
complexity of this fundamental term. The concepts of functions of an object and their
description, classification of failures, main characteristics of failure, possible causes of
failure, mechanisms of failure and consequences of failure and also other contributions
related with failure very closely, are all investigated and discussed. The paper also
deals with possible information sources on failure. In conclusion the paper serves to form
a complete picture to aid the understanding and implications of failures.
Keywords: Risk and Dependability Terminology, Failure.
1. Introduction
It is inevitable at some point failure (of equipment, a device, an item or a system) will
occur. The reasons for this occurrence can vary. Usually the main factor is that the
applied load exceeds the dimension/robustness of the product. The load can be purely
mechanical (i.e. force, tension), purely electrical (i.e. power, electromagnetic field), purely
chemical (such as the effect of chemical substances), general physical (i.e. warmth,
radiation), a combination of these or of a totally different nature. Whenever the applied
load exceeds the assumed dimension of the item, unwanted (usually irreversible) processes
start, and sooner or later a failure occurs. The load can be a one time load or it can be
applied a number of times. Concerning the first instance, overload failure will occur and
in the second case fatigue failure will occur. As time passes, the product could become
weaker for any one of many reasons (unless a failure occurs immediately). In dealing
with a failure one of the basic assumptions is that it is essential to have the device in
operation before any failure is incurred due to inner cause (e.g. operation or using an
item). Idleness of an item or a system can end in a failure due to natural ageing, but in
this case the initial mechanism is not properly understood. A relevant failure occurs
mostly only during operation.
In describing failures there are many factors and characteristics to consider. One of
which is the failure profile. This profile depends on the failure causes; the failure
manifestations (namely the ways and mechanisms of failure) and the failure consequence.
These failure causes can be design failures, manufacturing failures, overstress failures,
misuse failures or degradation failures. The failure manifestations may be random,
gradual, sudden, common cause, primary, secondary, intrinsic or extrinsic. The
consequences could be insignificant, marginal, minor, major, critical, catastrophic or also
scaled differently.
Failure is a term widely used in technical practice especially concerning dependability
theory. For reliability practitioners failure is a basic term in dependability theory, and it
is key and essential for observing stochastic relations of item behavior. It is an event
which is used in probability theories on a general level, the term probability event is used.
In dependability theory it is necessary to realize the fact of failure as a stochastic term, to
understand its meaning, and to understand other links. For this reason mathematical tools,
used in dependability, are not only a dead and boring “set” of formulas, relations and
graphical expressions.
While observing a technical item, the concern is on the possible causes of failures,
their development over time, their process, mechanism, and of course their impact, effect,
or other influences which might result from a failure occurrence. It is inevitable to realize
that a failure is of key importance for operation and function of technical items. Theory
and practice in particular shows that failures occur under different situations, various
circumstances, ranging conditions, etc. Theoretically dealing with failures, it is possible
to describe their causes, nature of occurrence, process of development, and modelling of
these failures is possible at the same time. It is possible to see connections between
individual groups of failures and their profiles. A range of importance and numerical
values associated with the failures can be determined. However, the fundamental desire is
to eliminate failure occurrence, reduce its frequency, limit the number of its occurrences
over a specified time period or in relation to another observed dependent quantity
(mileage, cycles, etc.). The ultimate intention is to be able to determine failure
occurrence exactly, simply, the aim is to get a better profile of an observed item from the
view of its dependability and related properties.
Furthermore, there is a necessity to be able to describe the possible classes of failures,
their profiles, development, consequences, and other relations which might be important
for dependability theory and especially for this paper itself. The phenomena involved in
this paper are definitely not an example of a complete and synoptic list of all known and
possible events assisting a failure. The aim of this paper is to introduce the topic which is
usually believed to be obvious, familiar and clear. However, reality need not match these
ideas. The purpose of the paper is also to initiate the reader into the topic of a failure and
at the same time to popularize it.
1.1. Notation
ISO International Organization for Standardization
IEC International Electrotechnical Commission
R(t) Reliability Function
t Time
PC Personal computer
MIL-STD Military Standard
MIL HDBK Military Handbook
RAC Reliability Analysis Center
RCM Reliability Centred Maintenance
EPRD Electronic Parts Reliability Data
NPRD Non-electronic Parts Reliability Data
FMD Failure mode/mechanism distribution
SPIDR System and Part Integrated Data Resource
FMECA Failure Modes, Effects and Criticality Analysis
PHA Preliminary Hazard Analysis
JSA Job Safety Analysis
OSHA Operating and Support Hazard Analysis
2. Current Terminology
Due to the ISO/IEC representatives, industrial experts and national bodies there is a set of
terminology related to the issue of failure. Failure according to the present version of the
IEC 60050 (191) is defined as: “termination of the ability of an item to perform a required
function”. There are three things to note: (1) After failure the item has a fault; (2)
Failure is an event, as distinguished from fault, which is a state; and (3) This concept as
defined does not apply to items consisting of software only.
Failure according to the newly upgraded version IEC 60050 (191) is defined as: “loss
of ability to perform as required”. It is noted that: (1) When the loss of ability is caused
by a pre-existing condition, the failure occurs when a particular set of circumstances is
encountered; (2) A failure of an item is an event, as distinct from a fault of an item,
which is a state; and (3) Qualifiers may be used to classify failures according to the
severity of consequences, such as catastrophic, critical, major, minor, marginal and
insignificant, the definitions depending upon the field of application.
It results from these definitions and further analysis that the term “failure” will be
understood as an event which leads straight to either a partial or complete loss of ability of
an item to fulfill a required function. There is currently in process the modification and
updating of terminology related to this topic, and hence the existing view of the
understanding of these concepts (failure and related facts) has been changing. Just to
demonstrate the complexity of the present state the following facts are introduced.
According to the notes of the term failure mentioned in IEC 60050 (191)/1990, an item
after failure has a fault (“An item after failure has a fault”). Owing to continual
discussions about this topic it is impossible to ignore the idea that a fault does not follow a
failure but precedes it. This technical incompatibility together with many others has not
been solved yet but their form has been very much discussed. A possible decision in
favour of a new view will influence radically the existing approach, conception and
observation of the failure.
While working with the term failure, as well as with relating states, it is necessary to
take the current terminology mismatch into account and to adapt possible decisions to it.
The possibility of a realized change has to be accepted along with all the resulting
consequences. Unfortunately, this change will violate the understanding of all existing
terms/disciplines introduced so far that deal with failure and dependability.
3. Influence of the failure
When considering a failure it is necessary to draw attention to some related events. As
the term failure relates to the prevention of an item’s ability in performing a required
function it is clear that this inability of a system or a product to operate in a required way
is a key term determining a failure.
Based of many studies and approaches a factual scale of individual functions
description was formed for a system. On the basis of these assumptions it is also
essential to distinguish the influence of a failure on a function performed by an item. A
failure occurrence might affect the range of the function. An outline of item functions is
provided to make the understanding much easier, though it should be remembered that
failure occurrence is not strictly limited to the type of item function.
A required function specifies an item task. A correct, exact and unequivocal
definition is a primary starting point for all dependability definitions as well as for a right
failure definition. Operation conditions affect significantly both dependability and
especially possible failure occurrence, hence the reason why they have to be determined
very thoroughly. The types of function are defined in Table 1.
Table 1: Function definitions
Function Type
Detail
Main function
An intended (required) or primary function
Minor function
Need for providing main function
Supporting function
The aim is to provide protection of people and an environment
from potential damage regarding main or minor function failure
as well as common support (brakes, circuit breakers, filters, etc.)
Information function
It provides conditions, monitoring, measuring, diagnostics, etc.
(it refers to displays, indicators etc.)
Interface function
It provides an interface between an assessed item and other
items (cabling, operating elements, switches, breakers, etc.).
The required function and/or operation conditions might be time dependent. In this
case a mission profile has to be determined and all dependability viewpoints have to be
related to it. A representative mission profile and corresponding dependability targets
have to be stated in the item’s specification. The mission duration is often/usually
considered as a parameter t, that is time. The dependability function – especially the
reliability function is designated as R(t). R(t) is the probability that no failure at item level
will occur in the interval (0;t, often with the assumption R(0) = 1, meaning that at time t =
0 the item was in the state of operation. In order to avoid confusion a distinction between
predicted and estimated (assessed) dependability should be made on the basis of a real
evaluation during operation or tests. The predicted dependability is calculated on the
basis of the item’s dependability structure and the failure rate. The estimated
dependability is specified on the basis of a statistical evaluation of dependability tests or
field data by known operating and environmental conditions.
However simple the failure definition, “it occurs when an item terminates its ability to
perform its required function”, might look, it is difficult to apply it to complex
items/systems. The basic operating time is generally a random variable. It is often
reasonably long but on the other hand it might be very short, caused by the influence of
systematic failure for example. It can also be caused by an early failure influence
resulting from a transient event at turn-on. A general presumption in investigating
failure-free operating times is that at t = 0 which means that in an instant t = 0 the item is
free of defects and systematic failures and therefore it is able to operate one hundred per
cent. Besides their relative frequency, failures can be categorized in a variety of ways,
namely mode, cause, consequence etc. The basic factors of each of these main categories
(critical stage, failure cause, failure mode, range of consequence, place of occurrence,
occurrence mechanism, and verification) are summarized in Table 2.
Table 2: Failure Profile categories
Category
Factors
Critical stage
Consequence seriousness
Failure cause
Misuse failure; mishandling failure; weakness failure;
design failure; manufacturing failure; ageing/wearout
failure; others (e.g. software)
Failure mode (velocity)
Sudden or gradual degradation
Range of a consequence
Cataleptic; complete; partial, other
Place of occurrence
During a test, during operation
Occurrence mechanism
Primary, secondary, systematic/reproducible
Verification Possibility
Verified, unverified
4. Failure Occurrence Cause
According to the IEC 60050 (191) the circumstances occurring during design,
manufacture or use which have resulted in a failure are the cause of a failure. To decide
how to prevent a failure or its reoccurrence it is necessary to know the cause of a failure.
Failure causes can be classified in relation to the life cycle of the system. The cause of a
failure can be intrinsic, due to weaknesses in the item and/or wearout, or extrinsic, due to
errors, misuse or mishandling during the design, production and especially the use itself.
Extrinsic causes often lead to systematic failures which are deterministic and might be
considered like defects (dynamic defects in software quality). Defects are present at t=0,
even if they cannot be discovered at t=0. Failures always seem to appear in time, even if
the time to failure is very short as it can be with systematic or early failures.
These causes can be further explained in terms of:
1) Design failure
2) Weakness failure
3) Manufacturing failure
4) Ageing failure
5) Misuse failure
6) Mishandling failure
7) Software failure
These causes are shown diagrammatically in Figure 1. Design failure occurs due to
inadequate design. It is basically any failure directly related to the item design. It means
that due to the item design a part of the whole degraded or got damaged and this resulted
in a failure of the whole. Weakness failure occurs due to weakness (internal) inherent or
induced in the system so that the system cannot stand the stress it encounters in its normal
environment. Manufacturing failure is caused by nonconformity during manufacturing
and processing. It is basically any failure caused by faulty processing, or inadequate
manufacturing, or an error made while controlling the process during manufacturing, tests
and repairs. An ageing failure is caused by the effects of usage and/or age. A misuse
failure is caused by misuse of the system (operating in environments for which it was not
designed). A mishandling failure is a failure caused by incorrect handling and/or lack of
care and maintenance. Software error failure is caused by a PC programmer error.
Fig 1: Failure cause classification
The failure mechanism is a very complex and extensive passage of the failure profile. It
can be sudden or gradual with its relating manifestations. The failure mechanism maybe
physical, chemical, electrical, thermal or other process that results in failure. The mode
(manifestation, course) of a failure is a symptom (local effect) by which a failure is
observed. For example – opens, shorts, or drifts (for electronic components). Brittle
rupture, creep, cracking, seizure, or fatigue (for mechanical components), etc.
Fig 2: Failure mechanism description
The general relations are shown in the Figure 2. The connections related to these aspects
of a failure are shown Table 3.
Table 3: Failure Mechanism Breakdown
Failure Mechanism
Description
Intermitted (incoherent) failure
A failure which lasts only for a short time. A good
example of this is a fault that occurs only under certain
conditions occurring intermittently (irregularly).
Extended failure
Failures that occur until some corrective action rectifies
the failure. They can be divided into two categories:
sudden or gradual
Sudden failure
A failure which occurs without warning
Gradual failure
A failure which occurs with signals to warn of the
occurrence. Usually it is a case of significant
behavior changes (decreasing performance, increasing
temperature, rising vibrations, etc.).
There is a need to distinguish among different failure mechanisms of mechanical,
electrical and hydraulic parts. The differentiation is so complex that it can not be easily
presented in this paper.
5. Failure Consequences
Many information sources use the term failure consequence. Also many standards define
them and work with them differently. This section aims to help to clarify the concept of
failure consequences from a reliability perspective. The effect (consequence) of a failure
can be different if considered on the item itself or at a higher level. A usual classification
of a failure has the following qualitative profile and is: non-relevant, partial, complete, …,
critical. Since a failure can also cause further failures in an item or a system, a
distinction between primary and secondary failure is important.
The severity of a failure mode is classified into four main categories. In accordance with
the MIL-STD 882 Table 4 lists these:
Table 4: Severity of a failure mode
Severity level
Description
Catastrophic failure
A failure that can lead to death or can cause total system (item)
loss.
Critical failure
A failure which results in many serious injuries or major system
damage. Sometimes we think of it as a failure, or combination of
failures, that prevents an item from performing a required mission.
Marginal failure
A failure that leads to minor injury or minor system damage.
Negligible failure
A failure that leads to less than minor injury of system damage.
Another classification can be found in the RCM approach where the following classes are
used:
Failures with safety consequences;
Failures with environmental consequences;
Failures with operational consequences;
Failures with non-operational consequences.
A classification of the failure severity into groups (categories) is given in more standards.
Each of them is specific in a way and corresponds with a presupposed application. The
IEC 61882, IEC 60812, IEC 50 126 and many others are some of the examples. It is not
the ambition here to make a complete list of failure consequences and their classification.
6. Sources of Failure Profile Information
There are various sources available from which failure measures and their characteristics
can be obtained. The main sources are:
1) Data on elements’ reliability guaranteed by a producer.
2) Conclusive test results (observation) of the same (comparable) item reliability. It is
based on the standardized assessment of reliability tests of technical items. The methods
and methodologies of how to conduct tests are standardized for different equipment.
3) Predictions – standardised calculation of item’s reliability based on a reliable source
(MIL HDBK 217F). This is the American military standard that enables the data on
electronic elements’ reliability to be estimated. It is commonly used when estimating the
elements’ failure rate especially in military applications.
4) Specialized information databases on elements’ reliability (specialized in terms of
elements’ profile or conditions of usage). Specialized information databases on
elements’ reliability are usually established and kept to meet the needs of single industrial
branches or technical areas. The data acquired when observing items in operation or the
results of specialized dependability tests are collected in the databases. One of the most
respectable and frequently used databases on reliability in this area is the database
established and kept by the Reliability analysis centre (RAC) which at present distributes
three important databases on the commercial basis: EPRD-97; NPRD-95; FMD-97;
SPIDR 2007.
5) General information database on elements’ reliability. These databases are usually
published as parts of specialized literature in the dependability area. The information put
in them is usually very general.
6) Expert estimations. Expert estimations of numerical values of reliability measures
might be used only when appropriate values cannot be specified by a different, more
reliable method. The authors of the article know from experience that this solution is
accepted only as an exception because in most cases the numerical values of reliability
measures can be determined by other methods described in this paper.
7. Conclusion
This contribution gives a general overview in the area of the supposed basic term “a
failure”. It can be seen from the knowledge expressed that there are many facets to
consider when dealing with this term. As the understanding of all related matters is very
complex it is not possible to express an exhaustive compendium of knowledge and
experience, however the key issues are explained. It is possible that some reliability and
safety engineers might be confused when beginning with a specific analysis (e.g. FMECA,
PHA, JSA, OSHA, etc.). The main benefit of this contribution is to be a general and
introductive material for understanding a failure, its full profile with all related
characteristics. The paper provides the information to orient the analyst on the
appropriate information source which is necessary for the analysis. There is not the
possibility to contribute all the overwhelming material but the most important outcome of
the paper is the fundamental guideline from a broad perspective of this term for basic
orientation in the reliability and safety engineering matters.
Acknowledgements
Thanks are also due to anonymous referees who helped improve this article. This is paper
has been prepared with support of the Grant Agency of the Czech Republic project No.
101/08/P020 – “Contribution to risk assessment of technical systems”. And with support
of the Ministry of Education, Youth and Sports of the Czech Republic, project
No. 1M06047 “Centre for Quality and Dependability of Production.
References
[1] W. R. Blishke, “Reliability: Modelling, Prediction, and Optimisation,”, John
Willey, 2000, New York.
[2] A. E. Elsayed, “Reliability Engineering,”, Addison-Wesley, 1996, New York.
[3] W. Q. Meeker, and A. E. Luis, “Statistical Methods for Reliability Data,”, John
Willey, 1998, New York.
[4] M. Modares, M. Kaminskyi, V. Kritsov, “Reliability Engineering and Risk
Analysis. A Practical Guide ,”, Marcel Dekker, 1999, New York.
[5] EPRD-97 Electronic Part Reliability Data. IIT Research Institute – Reliability
Analysis Center. Rome, New York. 1999.
[6] NPRD-95 Non-electronic Part Reliability Data. IIT Research Institute – Reliability
Analysis Center. Rome, New York. 1999.
[7] FMD-97 Failure Mode/Mechanism Distributions. IIT Research Institute –
Reliability Analysis Center. Rome, New York. 1999.
[8] SPIDR 2007 System and Part Integrated Data Resource. Alion Science and
Technology and System Reliability Center.