Article

A Measurement-Based Model for Workload Dependence of CPU Errors

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper proposes and validates a methodology to measure explicitly the increase in the risk of a processor error with increasing workload. By relating the occurrence of a CPU related error to the system activity just prior to the occurrence of an error, the approach measures the dynamic CPU workload/failure relationship. The measurements show that the probability of a CPU related error (the load hazard) increases nonlinearly with increasing workload; i.e., the CPU rapidly deteriorates as end points are reached. The load hazard is observed to be most sensitive to system CPU utilization, the I/O rate, and the interrupt rates. The results are significant because they indicate that it may not be useful to push a system close to its performance limits (the previously accepted operating goal) since what we gain in slightly improved performance is more than offset by the degradation in reliability. Importantly, they also indicate that conventional reliability models need to be reevaluated so as to take system workload explicitly into account.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... En revanche, l'efficacité des techniques de durcissement aux aléas est réduite de manière dramatique en CMOS submicronique, où elle a également un impact majeur sur la vitesse, la dissipation et la densité d'intégration. D'autre part, des études expérimentales ( [22] [23]) montrent que les fautes transitoires représentent plus de 80% des fautes des systèmes intégrés. La plupart des causes des fautes transitoires générées par des sources internes ou externes (bruit et couplages électromagnétiques, etc.) peuvent être éliminées, sauf les effets du rayonnement. ...
... Since they pass the manufacturing tests with higher "hidden" defect escape rates, these combined effects may lead to a significant increase of field failure rates: transient faults, performance degradation and catastrophic failures. Transient faults, that account for more than 90% of field failures in earlier technologies, are mainly caused by critical timing, noise and packageor environment-induced particle radiation [22][23][24]. These failure modes show significant increase at higher integration densities, faster operating speeds and lower supply voltages, with potentially destructive effects in system-on-chip designs. ...
... Experimental studies [22] [23] have shown that more than 80 percent of system failures are transient in nature. Transient faults (TF) are temporary logic state flips of data sensitive nodes, generated by internal or external sources (i.e. ...
Article
High performance ICs manufactured in deep submicron CMOS show reduced operating margins for timing, power and noise, and increased device sensitivity to contamination, size variations and cosmic ray effects. As a consequence, radiation-induced soft errors and soft failures due to small manufacturing defects that escape voltage-mode testing represent a chief concern in deep submicron CMOS. This thesis describes design and test techniques for high reliability and fault tolerance to cope with soft failures and soft errors in both commercial and safety-critical system applications. To improve the IDDQ test effectiveness in detecting soft failures, we developed highly sensitive Built-In Current (BIC) sensor designs operating at high speed and low supply voltage. Optimized IDDQ test algorithms with embedded current monitors are proposed, and synergetic effects with low power design techniques are explored. On-chip IDDQ monitoring techniques are subsequently extended to on-line testing in safety-critical CMOS system applications. An upset-tolerant static RAM design is described that uses current monitoring and parity coding for error detection and correction. Radiation test results on a prototype circuit validate this approach. In order to avoid soft error occurrence in deep submicron CMOS applications, upset-immune design techniques using technology-independent local redundancy are described and analyzed. They are validated on memory and register array prototypes using commercial 1.2, 0.8 and 0.25mm CMOS processes. On-chip test techniques are implemented for redundancy assessment of fault-tolerant CMOS architectures. Upset mechanisms in SEU-hardened CMOS storage elements are detected and analyzed using a focused pulse laser equipment, and specific design rules are devised for topology-related hardening. An upset-hardened sequential cell library has been designed in 0.6mm CMOS to be employed in an ASIC modem chip for an onboard satellite experiment.
... The load-hazard model in (Iyer and Rossetti 1986) is adapted, which provides an analysis of the impact of Load on the probability of the machine to run without any CPU or system errors. They propose a load-dependent hazard or failure rate model z(x) , where x is the gateway load L Gw j . ...
... Based on the data from (Iyer and Rossetti 1986) the hazard function z(L Gw j ) is approximated to a third-degree polynomial Eq. (4.12) where x is between 0.08 and 0.96 in CPU usage, the min value is 0.0018 and the max value 0.0118 z(x) = 0.0195x 3 − 0.137x 2 + 0.0059x + 0.0015 (4.12) ...
Thesis
Full-text available
The extension of the Cloud to the Edge of the network through Fog Computing can have a significant impact on the reliability and the latency of deployed applications. Recent papers have suggested a shift from Virtual Machines and Container based deployments to a shared environment among applications to better utilise resources. The existing deployment and optimisation methods do not account for application interdependence or the locality of application resources, which can cause inaccurate estimations. When considering models that account for these however, the optimisation task of allocating applications to gateways becomes a difficult problem to solve that requires either model simplifications or tailor-made optimisation methods. The main contribution of this research is the set of weighted deployments methods that aim to reduce the complexity of the search-space in large-scale fog deployment environments while retaining significant system characteristics. This work was attained by first addressing some existing IoT issues by proposing a Fog of Things gateway platform to answer the connectivity and protocol translation requirements. The proposed platform was used to identify the characteristics and challenges of these systems. A new data-driven reference model was then proposed to estimate the effects of application deployment and migration on these systems. Based on this model, weighted clustering and resource allocation methods are defined, that are then improved upon by a set of weight tuning methods focusing on analysing favourable and sample deployments. These proposals were validated by running tests on Industry 4.0 case studies. These varying scenarios made it possible to identify the scaling and deployment characteristics of these systems. Based on these initial tests, the second batch of physical and virtual experiments was carried out to validate the models and to evaluate the proposed methods. The findings show that the proposed application and gateway model can predict the load and delay of components to an accuracy of 91%. Within the presented scenarios, constraints and Fog sizes larger than 300 applications, the proposed weighted clustering methods were shown to significantly improve the utility of deployments. In some cases, these methods were the sole providers of valid solutions
... We adapt the load-hazard model in [39] which provides an analysis of the impact of Load on the probability of the machine to run without any cpu or system errors. They propose a load dependent hazard or failure rate model z(x) , where x is the gateway load L Gw j . ...
... Based on the data from [39] the hazard function z(L Gw j ) is approximated to a thirddegree polynomial (13) where x is between 0.08 and 0.96 in CPU usage, the min value is 0.0018 and the max value 0.0118 z(x) = 0.0195x 3 − 0.137x 2 + 0.0059x + 0.0015 (13) Considering a constant runtime of a day or 24 hours we can define our Gateway Reliability R Gw j using the Load based Reliability function R(L Gw j ) in (14). This gives us a maximum reliability of 95.5% and a minimum reliability of 75.35%. ...
Article
Full-text available
The extension of the Cloud to the Edge of the network through Fog Computing can have a significant impact on the reliability and latencies of deployed applications. Recent papers have suggested a shift from VM and Container based deployments to a shared environment among applications to better utilize resources. Unfortunately, the existing deployment and optimization methods pay little attention to developing and identifying complete models to such systems which may cause large inaccuracies between simulated and physical run-time parameters. Existing models do not account for application interdependence or the locality of application resources which causes extra communication and processing delays. This paper addresses these issues by carrying out experiments in both cloud and edge systems with various scales and applications. It analyses the outcomes to derive a new reference model with data driven parameter formulations and representations to help understand the effect of migration on these systems. As a result, we can have a more complete characterization of the fog environment. This, together with tailored optimization methods than can handle the heterogeneity and scale of the fog can improve the overall system run-time parameters and improve constraint satisfaction. An Industry 4.0 based case study with different scenarios was used to analyze and validate the effectiveness of the proposed model. Tests were deployed on physical and virtual environments with different scales. The advantages of the model based optimization methods were validated in real physical environments. Based on these tests, we have found that our model is 90% accurate on load and delay predictions for application deployments in both cloud and edge.
... During the working hours the utilization/activity of the systems is higher than the non-working hours which increases the occurrence of failures. The same conclusion has been drawn in [14] [22]. On the basis of the given conclusion, a linear failure rate (failure rate and hazard rate are being used interchangeably in this work) model directly proportional to the utilization has been proposed as follows ...
... This policy has been proposed in Function 2 to optimize the energy consumption by the VMs in the presence of failures. In this policy all the resources will be sorted in the increasing order according to their power consumption corresponding to the current utilization (equation 14), so that the VM with maximum utilization will be allocated to a node with minimum power consumption. R sor t ed ← P j .sortPowerIncreasing() ...
Conference Paper
Full-text available
Reliability and Energy-efficiency is one of the biggest trade-off challenges confronting cloud service providers. This paper provides a mathematical model of both reliability and energy consumption in cloud computing systems and analyses their interplay. This paper also proposes a formal method to calculate the finishing time of tasks running in a failure prone cloud computing environment using checkpointing and without checkpointing. To achieve the objective of maximizing the reliability and minimizing the energy-consumption of cloud computing systems, three resource provisioning and virtual machine (VM) allocation policies using the aforementioned mathematical models are proposed. These three policies are named Reliability Aware Best Fit Decreasing (RABFD), Energy Aware Best Fit Decreasing (EABFD), Reliability-Energy Aware Best Fit Decreasing (REABFD). A simulation based evaluation of the proposed policies has been done by using real failure traces and workload models. The results of our experiments demonstrated that by considering both reliability and energy factors during resource provisioning and VM allocation, the reliability and energy consumption of the system can be improved by 23% and 61%, respectively.
... In most circumstances, an increased load induces a higher component failure rate [2]. Many empirical studies of mechanical systems [3] and computer systems [4][5] have proved that the workload strongly affects the component failure rate. Applications of load-sharing systems include electric generators sharing an electrical load in a power plant, CPUs in a multiprocessor computer system, cables in a suspension bridge, and valves or pumps in a hydraulic system [6]. ...
... In this chapter, we focus on the reliability analysis of load-sharing systems with a limited number of components. These models have a wide range of applications in reliability engineering [1][2][3][4][5][6][58][59][60][61]. Most approximations and asympotic behaviors that are applicable for systems with a large number of elements (fiber bundles) are not applicable when the number of components (elements) in the system is very small. ...
Book
Full-text available
This is an edited book by Krishna B. Misra comprising 76 chapters, 1316 pages contributed by 100 well-known researchers of the world. There are 13 chapters contributed by the editor himself. Performability engineering provides us with the framework to consider both dependability and sustainability for the optimal design of products, systems or services. Whereas dependability is an aggregate of one or more of the attributes of survivability (such as quality, reliability, and maintainability etc.) and safety, and the present designs based on dependability and life cycle costs cannot be really called truly optimal since these attributes are strongly influenced by the design, raw materials, fabrication, techniques and manufacturing processes employed, and their control and usage. Therefore, sustainability, characterized by dematerilization, energy and waste minimization, disposability, reuse and recycling and other the environmental considerations which help in clean production, must be considered along with dependability. Design of 21st Century products, systems and services must conform to performability designs. More so when world resources are on the decline and to keep pace with rising population, the increased volume of production is bound to affect the world’s environmental health further. As of now, dependability and cost effectiveness are primarily seen as instruments for conducting the international trade in the free market environment and thereby deciding the economic prosperity of a nation. However, the internalization of the hidden costs of environment preservation will have to be accounted for, sooner or later, in order to be able to produce sustainable products and systems in the long run. These factors cannot be ignored any more and must not be considered in isolation of each other. The Handbook of Performability Engineering considers all aspects of performability engineering, providing a holistic view of the entire life cycle of activities of the product, along with the associated cost of environmental preservation at each stage, while maximizing the performance. Comments by Way Kuo, Editor-in-Chief IEEE Transactions on Reliability and :President, City University of Hong Kong, Formally Dean of Engineering and University Distinguished Professor, University of Tennessee " The editor of the present Handbook of Performability Engineering, Dr. Krishna B. Misra, a retired eminent professor of the Indian Institute of Technology, took to reliability nearly four decades ago and is a renowned scholar of reliability. Professor Misra was awarded a plaque by IEEE Reliability Society, in 1995, “in recognition of his meritorious and outstanding contributions to Reliability Engineering and furthering of Reliability Engineering Education and Development in India". Upon his retirement in 2005 from IIT, Kharagpur, where he established the first ever Reliability Engineering Centre in India and the postgraduate course in Reliability Engineering in 1982, he launched the International Journal of Performability Engineering in 2005 and has since led the journal as its inaugural Editor-in-Chief. The timely publication of this handbook necessarily reflects the changing scenario of the 21st century’s holistic view of designing, producing and using products, systems or services which satisfy the performance requirements of a customer to the best possible extent. Having reviewed the contents of this voluminous handbook, and its contributed chapters, I find it clearly covers the entire canvas of performability: quality, reliability, maintainability, safety and sustainability. The handbook addresses how today’s systems need to be not only dependable (implying survivability and safety) but also sustainable. Modern systems need to be addressed in a practical way instead of simply as a mathematical abstract, often bearing no physical meaning at all. In fact, performability engineering not only aims at producing products, systems and services that are dependable but also involves developing economically viable and safe processes of modern technologies, including clean production that entails minimal environmental pollution. Performability engineering extends the traditionally defined performance requirements to incorporate the modern notion of requiring optimal quantities of material and energy in order to yield safe and reliable products that can be disposed of without causing any adverse effects on the environment at the end of their life cycle. The chapters included in this handbook have undergone a thorough review and have been carefully devised. These chapters collectively address the issues related to performability engineering. I expect the handbook will create an interest in performability and will bring about the intended interaction between various players of performability engineering. I firmly believe this handbook will be widely used by the practicing engineers as well as serve as a guide to students and teachers, who have an interest in conducting research in the totality of performance requirements of the modern systems of practical use. I would also like to congratulate Dr. Misra once again for taking the bold initiative of editing this historical volume.” Another comment by “Performability Engineering has as its scope the evaluation of all aspects of system performance. This encompasses the evaluation of the reliability of the system, its costs, its sustainability, its quality, its safety, its risk, and all of its performance outputs. In covering this broad scope, the objective is to provide a unified framework for comparing and integrating all aspects of system performance. This provides the manager and decision-maker with a complete, consistent picture of the system. This is the promise and exciting prospect of Performability Engineering. The chapters included in this handbook are diverse and represent the vitality of the different aspects of Performability Engineering. There are management-oriented chapters on the roles of reliability, safety, quality assurance, risk management, and performance management in the realm of performability management. There are chapters providing overview and the state-of-the-art on basic approaches being used in various disciplines. There are original technical contributions describing new methods and tools. Finally, there are chapters focusing on design and operational applications. The reader therefore has a veritable garden from which to feast from this impressive collection of chapters in the handbook. In short, it is expected that this handbook will be found to be very useful by practicing engineers and researchers of the 21st Century in pursuing this challenging and relevant area for sustainable development.” Another comment by Hoang Pham Professor & Chair Department of Industrial & Systems Engineering Rutgers University, USA : "This is an excellent handbook that covers comprehensive topics including engineering design, system reliability modeling, safety analysis and perspectives, design optimization, environmental risk analysis, engineering management, roadmap for sustainability, performance economical analysis, quality management and engineering, process control, six sigma, robust design, continuous improvements, load-sharing system analysis, repairable system reliability, multiple phase-mission system reliability and imperfect coverage, Markov and Semi-Markov system reliability analysis, field data analysis, multi-state system reliability analysis, optimization, accelerated life testing, fault trees, common cause analysis, human-system interaction analysis, safety control analysis, probabilistic risk assessment, risk analysis and management, maintenance, sustainability, performability, replacement policies, MEMS, medical device analysis, electro and mechanical reliability assessment, Wireless communication network reliability, distributed system computing, fault-tolerant system reliability, software reliability, and reliability growth models. I am sure that many, if not all, practitioners and researchers in the areas of reliability, safety, maintainability and related fields, including beginning students who are majoring or thinking of entering in reliability/performability research, will find this Handbook useful in many ways – looking for methodologies, solutions, problems or research ideas." Yet another comment by Dr William Vesely, Manager, Risk Assessment, Office of Safety and Mission Assurance, NASA “Performability Engineering has as its scope the evaluation of all aspects of system performance. This encompasses the evaluation of the reliability of the system, its costs, its sustainability, its quality, its safety, its risk, and all of its performance outputs. In covering this broad scope, the objective is to provide a unified framework for comparing and integrating all aspects of system performance. This provides the manager and decision-maker with a complete, consistent picture of the system. This is the promise and exciting prospect of Performability Engineering. The chapters included in this handbook are diverse and represent the vitality of the different aspects of Performability Engineering. There are management-oriented chapters on the roles of reliability, safety, quality assurance, risk management, and performance management in the realm of performability management. There are chapters providing overview and the state-of-the-art on basic approaches being used in various disciplines. There are original technical contributions describing new methods and tools. Finally, there are chapters focusing on design and operational applications. The reader therefore has a veritable garden from which to feast from this impressive collection of chapters in the handbook. In short, it is expected that this handbook will be found to be very useful by practicing engineers and researchers of the 21st Century in pursuing this challenging and relevant area for sustainable development.”
... In the literature, various studies have empirically shown that the component failure rate is strongly affected by its workload. 5,6 Therefore, it is important to consider the effect of load when analyzing the reliability of multistate systems. Examples of recent research considering the dependence of the component failure rate on its load can also be found in the literature, and they further justify the importance of considering the load when evaluating the reliability of multi-state systems. ...
... However, the failure rate of element e j also increases with its increasing load. 5,6 Hence, increasing the load on element e j reduces the reliability of element e j , and thus reduces the reliability of SWS. As a result, the reliability of the SWS is not a monotonic function of the loads given to each ME. ...
Article
Many engineering systems are designed to support load with different amounts. The load carried by the system has a significant effect on the deterioration of the system elements. This article presents an algorithm for determining the optimal load given to each multi-state element in a linear sliding window system such that the reliability of the system can be maximized. The model considers both the effect of load on the failure rate and the relationship between load and performance for each element. A reliability evaluation algorithm based on universal generating function is suggested for the proposed load-dependent linear sliding window system. The optimal load distribution problem is solved using genetic algorithms. Numerical experiments are presented to illustrate the proposed methods.
... The increase of workload affects both CPU utilization and computation time [5], which further causes degradation of cloud service availability [6]. Iyer et al. [7] reported a correlation between This work was supported in part by JSPS KAKENHI, Japan, under Grant Number 18H03230. ...
... A more short-term operational option is to control the equipment's deterioration process by adjusting the production rate. This applies, for instance, to wind turbine gearboxes and generators that deteriorate faster at higher speeds (Feng et al., 2013;Zhang et al., 2015a), conveyor belts that fail more often when used at higher rotational speeds (Nourelfath and Yalaoui, 2012), trucks that fail earlier when heavier loaded (Filus, 1987), large computer clusters that fail more often under higher workloads (Ang and Tham, 2007;Iyer and Rossetti, 1986), and cutting tools that wear faster at higher speeds (Dolinšek et al., 2001). In addition, the recent advent of the Internet of Things (IoT) allows to control production rates remotely and in real-time, thereby making it practically viable to exploit this relation between production and deterioration. ...
Thesis
Full-text available
Machines used in production facilities deteriorate as a result of load and stress caused by production, and thus they eventually require maintenance to stay in an operating condition. The deterioration rate of a machine typically depends on the production rate, implying that we can control its deterioration, and thereby the maintenance planning, by dynamically adjusting the production rate. The main contribution of this thesis is to integrate the two research fields of production planning and condition-based maintenance. We introduce and explore the benefits of condition-based production and maintenance policies that exploit the relation between production and deterioration to improve the trade-off between the opposing targets of having high production outputs and low maintenance costs. The obtained insights are not only interesting from a theoretical point of view, but are also practically relevant due to the ongoing developments in the fields of sensor equipment and the internet of things. These developments enable operators to remotely monitor the deterioration of equipment and to control its usage in real-time, thereby allowing to implement automated condition-based production policies.
... A common example of transient fault is the inducing in memory cells of spurious values, caused by charged particles (e.g., alpha particles) passing through them [Krishna, 2014]. In computer systems transient faults occur much more frequently than permanent faults do [Castillo et al., 1982;Iyer and Rossetti, 1986]. Generally, there are two major techniques to recover transient faults, primarybackup execution [Al-Omari et al., 2004] and checkpointing [Punnekkat et al., 2001]. ...
... A more short-term operational option is to control the equipment's deterioration process by adjusting the production rate. This applies, for instance, to wind turbine gearboxes and generators that deteriorate faster at higher speeds (Feng et al. 2013, Zhang et al. 2015, conveyor belts that fail more often when used at higher rotational speeds (Nourelfath and Yalaoui 2012), trucks that fail earlier when heavier loaded (Filus 1987), large computer clusters that fail more often under higher workloads (Ang andTham 2007, Iyer andRossetti 1986), and cutting tools that wear faster at higher speeds (Dolinšek et al. 2001). In addition, the recent advent of the Internet of Things (IoT) allows to control production rates remotely and in real-time, thereby making it practically viable to exploit this relation between production and deterioration. ...
Article
Full-text available
Problem definition: Many production systems deteriorate over time as a result of load and stress caused by production. The deterioration rate of these systems typically depends on the production rate, implying that the equipment’s deterioration rate can be controlled by adjusting the production rate. We introduce the use of condition monitoring to dynamically adjust the production rate in order to minimize maintenance costs and maximize production revenues. We study a single-unit system for which the next maintenance action is scheduled upfront. Academic/Practical Relevance: Condition-based maintenance decisions are frequently seen in the literature. However, in many real-life systems, maintenance planning has limited flexibility and cannot be done last minute. As an alternative, we are the first to propose using condition information to optimize the production rate, which is a more flexible short-term decision. Methodology: We derive structural optimality results from the analysis of deterministic deterioration processes. A Markov decision process formulation of the problem is used to obtain numerical results for stochastic deterioration processes. Results: The structure of the optimal policy strongly depends on the (convex or concave) relation between the production rate and the corresponding deterioration rate. Condition-based production rate decisions result in significant cost savings (by up to 50%), achieved by better balancing the failure risk and production output. For several systems, a win-win scenario is observed, with both reduced failure risk and increased expected total production. Furthermore, condition-based production rates increase robustness and lead to more stable profits and production output. Managerial Implications: Using condition information to dynamically adjust production rates provides opportunities to improve the operational performance of systems with production-dependent deterioration.
... For most of the clusters, it is observed that during the working hours of a day, i.e., 7 am to 7 pm, the number of failure increases. Similar kind of impact of time on the occurrence of failures is seen in other observations as well [33,34]. This represents the correlation between the occurrence of failures and utilization/activity of the system, such that, during the working hours, the utilization/activity of the systems is higher than the nonworking hours, which increases the failure rate of systems. ...
Article
VM consolidation is an important technique used in cloud computing systems to improve energy efficiency. It migrates the running VMs from under utilized physical resources to other resources in order to reduce the energy consumption. But in a cloud computing environment with failure prone resources, focusing solely on energy efficiency has adverse effects. If the reliability factor of resources is ignored then the running VMs may get consolidated to unreliable physical resources. This will cause more failures and recreations of VMs, thus increasing the energy consumption. To solve this problem, this paper proposes a failure-aware VM consolidation mechanism, which takes the occurrence of failures and the hazard rate of physical resources into consideration before performing VM consolidation. We proposed a failure prediction technique based on exponential smoothing to trigger two fault tolerance mechanisms (VM migration and VM checkpointing). A simulation based evaluation of the proposed VM consolidation mechanism was conducted by using real failure traces. The results demonstrate that by using the combination of checkpointing and VM migration with the proposed failure-aware VM consolidation mechanism, the energy consumption of cloud computing system is reduced by 34% and reliability is improved by 12% while decreasing the occurrence of failures by 14%.
... Other common cases are servers in a distributed computer system and pumps in hydraulic systems [2]. With an increasing workload, the failure rate and reliability of each surviving component will increase and the overall system risk will increase [3]- [4]. ...
Article
Full-text available
In this work, we propose a Failure Mechanism Cumulative (FMC) model that considers the loading history of the load sharing effect in a k-out-of-n system. Three types of failure mechanisms are considered, continuous degradation, compound point degradation, and sudden failure due to shock. By constructing a logic diagram with Functional DEPendence (FDEP) gate, the load-sharing effect can be explained from the failure mechanism point of view using a mechanism-acceleration (MACC) gate that shows that when one component fails, the failure mechanisms of the other surviving components will be accelerated. By deriving the total damage equation and a constructing failure behavior model, the system reliability of a k-out-of-n system with different types of failure mechanisms were evaluated. A voltage stabilizing system that contains a 1-out-of-2 subsystem, or a 2-out-of 3 subsystem was used to illustrate the practical applicability of the proposed approach. A combined Monte-Carlo and Binary Decision Diagram (BDD) method was used in the numerical simulation process.
... When itis captured by a clock edge, a soft error occurs. Otherwise, that pulse is called a transient fault [12].In recent years, with advance in the technology of fabrication, transistor quantities, processors are becoming increasingly vulnerable to transient faults [13].Transient faults currently account for over 80% of faults in processor-based devices [14].In a typical integrated circuit, memory arrays, latch elements, and combinational logic are the most sensitive parts and could be affected by soft errors and transient faults. ...
Article
Full-text available
As transistors become increasingly smaller and faster and noise margins become tighter, circuits and chip specially microprocessors tend to become more vulnerable to permanent and transient hardware faults. Most microprocessor designers focus on protecting memory elements among other parts of microprocessors against hardware faults through adding redundant error-correcting bits such as parity bits. How ever ,the rate of soft errors in combinational parts of microprocessors is consider edas important as in sequential parts such as memory elements nowadays. The reason is that advances in scaling technology have led to reduced electrical masking .This paper proposes and evaluates a logic level fault-tolerant method based on parity for designing combinational circuits. Experimental results on a full adder circuit show that the proposed method makes the circuit fault- tolerant with less overhead in comparison with traditional methods. It will also be demonstrated that our proposed method enables the traditional TMR method to detect multiple faults in addition to single fault masking. Keywords: Soft Error, Transient Fault, Fault-Tolerance, Combinational Circuits, Full Adder.
... According to the load-sharing rule, the workload shared by each surviving component increases once a component fails. Many empirical studies of mechanical systems [5,6], computer systems [7], and battery systems [8] have proved that the workload strongly affects the component failure rate. For mechanical systems, the degradation processes due to common failure mechanisms (e.g., wear degradation, corrosion, fracture, fatigue, etc.) [9] are dependent on the workload. ...
Article
Full-text available
Many systems often experience multiple failures resulting from simultaneous exposure to degradation processes and random shocks. For a load-sharing system, the dependencies among the degradation processes, random shocks and component failures potentially cause the system to fail more easily, which poses new challenging issues to evaluate the reliability. A novel reliability model for load-sharing systems subject to dependent degradation processes and random shocks is proposed. The new model extends previous models for simple parallel systems by considering the characteristics and specific dependencies of load-sharing systems. In a load-sharing system, the workload and shock load shared by each surviving component will increase after a component failure, leading to the higher degradation rate, the more serious sudden degradation damage caused by random shocks, and the greater probability of hard failure. In the model, the analytical expression is utilized to calculate the complex reliability. The complexity of this calculation is caused by the stochastic failure time of surviving components, the stochastic arriving time of shocks, and their interaction. A case of the load-sharing redundant micro-engines in Micro-Electro-Mechanical System is presented to demonstrate the proposed model, and the result shows that the reliability of load-sharing system is lower than that of a simple parallel system.
... In many technical systems, elements can function at different load levels. Empirical studies confirm theoretical modeling showing that the operation workload of elements can affect their performance (productivity) and the time-to-failure distribution [1][2][3] . Thus, the overall system performance and reliability depend on the selected load levels for elements. ...
Article
This paper considers specific series systems exposed to external shocks and internal failures. The systems must complete a specified amount of work. The system elements can operate with different load levels that determine their performance and internal failure rates. The increase of the elements’ loading, on one hand increases their internal failure rate, but, on the other hand, it increases its performance level, which leads to reduction of the mission time. The optimal loading should achieve the balance between these two effects and provide the largest mission success probability. Two different types of series systems characterized by different definitions of time during which the elements are exposed to external shocks are considered. The cases when the elements are exposed to a common shock process and to different independent shock processes are also analyzed. Illustrative examples are presented.
... Electric generators sharing an electrical load in a power plant, CPUs in a multiprocessor computer system, cables in a suspension bridge and valves or pumps in a hydraulic system are examples of load-sharing systems (see Kuo and Zuo 2003). Many empirical research results have indicated that the workload strongly affects a component's failure rate and an increased load induces a higher failure rate (see Kapur and Lamberson 1977, Iyer and Rossetti 1986, Barros et al. 2003. In a load-sharing system, if a component fails, the same workload has to be shared by the remaining components, resulting in an increased load shared by each surviving component. ...
Article
This paper proposes a new model that generalizes the traditional Markov repairable system to the case of spatial dependence among components. The components of the system are identical and arranged in two lines and consist of a lattice. The performance of each component depends on its spatial "neighbours" and the number of failed components in other lines. Markov process is adopted to model the performance of the system. The state space and transition rate matrix corresponding to a 6-component lattice load-sharing system with spatial dependence are presented. Availability of the system is obtained via Markov theory and Laplace transform method. A numerical example is given to illustrate the results in this paper. The states of the system are partitioned into four state sets: security, degraded, warning, and failed. The probabilities of visiting to four state sets are also discussed in the numerical example. The work might provide a basis for the reliability analysis of load-sharing systems with interacting components that themselves be arranged in some two-dimensional spatial pattern.
... It is worth mentioning that the type 2 failure interaction between components is common in the load-sharing systems [16][17][18][19] . In material testing, software reliability, population sampling, mechanical engineering the load can strongly impact the component state [20][21][22][23] (failure rate, reliability, availability, damage level etc.). Because in a loadsharing system, when a component fails, the static or time-varying workload [24] is undertaken by the non-failed components. ...
Article
This paper presents a two-component load-sharing system. The failure rates of the two components are time dependent and load dependent. Whenever one fails, it is imperfectly repaired with a time delay during which the failure rate of the survival component increases because of the resulting overload. Three maintenance policies are proposed considering imperfect preventive maintenance and system replacement. The optimal average costs in the long run under different maintenance policies are derived from the theoretical propositions. Sensitivity analyses through numerical examples are carried out.
... Each element can function at different load levels. It has been shown by empirical studies that the workload applied to a system element when operating can affect its performance (productivity) and time-to-failure distribution [17][18][19]. Consequently, the overall system performance can heavily depend on the selected element load levels. ...
Article
Many real-life systems have series parallel structures, where some components or subsystems work in parallel while some others have to work in series or consecutively. This paper models dynamic performance of multi-state series parallel systems with repairable elements that can function at different load levels. Performance (productivity) and time-to-failure distribution of an operating system element depend on its load level. An element, upon failure, can be repaired with repair time obeying a known distribution. The entire system must satisfy a random demand during a fixed mission time or must complete a fixed amount of work. A discrete numerical algorithm is first proposed for evaluating instantaneous availability of a system element with a particular load level, which further defines stochastic process of the element's performance. A universal generating function technique is then used for assessing system performance metrics including expected system performance, expected probability of meeting system demand, expected amount of unsupplied demand over a particular mission time and expected time needed to perform a given amount of work for the considered system. The proposed methodology is applicable to arbitrary types of failure time and repair time distributions. Another original contribution of this work is formulating and solving elements loading optimization problems, which choose elements load levels to achieve one of the following objectives: maximum system expected performance, maximum expected probability of meeting demand during a time horizon, minimum total unsupplied demand during the mission, or minimum completion time for a given amount of work. As demonstrated through a case study of a power station coal transportation system, optimization results can provide effective guidance on optimal operational load of multi-state series parallel system elements.
... In a multicomponent system, the failure of one component can reduce the redistribution of the system loading (Yu et al. [1]). Furthermore, many empirical researches have indicated that the workload strongly affects a component's failure rate and an increased load induces a higher failure rate [2][3][4][5]. Thus how to describe the dependency among components in a system has been an interesting topic. ...
Article
Full-text available
Star repairable systems with spatial dependence consist of a center component and several peripheral components. The peripheral components are arranged around the center component, and the performance of each component depends on its spatial “neighbors.” Vector-Markov process is adapted to describe the performance of the system. The state space and transition rate matrix corresponding to the 6-component star Markov repairable system with spatial dependence are presented via probability analysis method. Several reliability indices, such as the availability, the probabilities of visiting the safety, the degradation, the alert, and the failed state sets, are obtained by Laplace transform method and a numerical example is provided to illustrate the results.
... It has been shown through empirical studies that the load level can significantly impact failure behavior of a system element [1,2,31,32]. The accelerated failuretime model (AFTM) is widely-used for describing the relationship between element load and failure behavior, where the effect of load is multiplicative in time [7,33]. ...
Article
The paper suggests a new fault coverage model for the case when the effectiveness of recovery mechanisms in a subsystem depends on the entire performance level of this subsystem. Examples of this effect can be found in computing systems, electrical power distribution networks, communication systems, etc. The paper presents a modification of the generalized reliability block diagram (RBD) method for evaluating reliability and performance indices of complex multistate series-parallel systems with performance-dependent fault coverage under the assumption that the system state cannot change during the task execution. The suggested method based on a universal generating function technique allows the system performance distribution to be obtained using a straightforward recursive procedure. Illustrative examples are presented.
... The performance of such systems depends on the amount of load it is carrying. Besides, many studies have empirically shown that the failure rate of the system element is strongly affected by its working load [12,13]. Therefore, it is important to consider the effect of loading when analyzing the availability of multi-state systems.As a result, several recent research works have studied the optimal loading of multistate systems [14][15][16][17][18][19]. ...
... Selecting the appropriate fault model is a crucial decision. It is frequently estimated that 80% or more of all hardware faults which occur on ground based computer applications are transient in nature [111]. The effects may last for milliseconds, or they may persist until the system is rebooted or refreshed. ...
Article
This paper shows how it is possible construct PRA models of digital instrumentation and control system using the DFM and Markov/CCMT as two example dynamic methodologies. The digital feedwater control system of a PWR has been used as an example system to illustrate the process. The prime implicants and their probabilities generated by these two methodologies have been compared. The comparison shows a very close consistency between the DFM and Markov/CCMT results. The power of these dynamic methodologies is their ability to identify combinations of component failure modes, even across time boundaries, that can result in system failure modes that otherwise would be very difficult to identify with a standard ET/FT approach. Applications of either methodology require complete and thorough supporting analyses (e.g. FMEA) and data (e.g. transition and failure data for components), as well as a system model describing the system behavior under normal and upset conditions (e.g. simulator).
... Selecting the appropriate fault model is a crucial decision. It is frequently estimated that 80% or more of all hardware faults which occur on ground based computer applications are transient in nature [111]. The effects may last for milliseconds, or they may persist until the system is rebooted or refreshed. ...
Technical Report
Full-text available
As part of the U.S. Nuclear Regulatory Commission's (NRC's) effort to advance the state-ofthe- art in digital) system risk and reliability analysis the NRC Office of Nuclear Regulatory Research is sponsoring research into both traditional and dynamic methods for modeling. The results of a recent study reported in NUREG/CR-6901 indicate that the conventional event-tree (ET)/fault-tree (FT) methodology may not yield satisfactory results in the reliability modeling of digital I&C systems. Using subjective criteria based on reported experience, NUREG/CR-6901 has identified the dynamic flowgraph methodology (DFM) and the Markov methodology as the methodologies that rank as the top two with most positive features and least negative or uncertain features when evaluated against the requirements for the reliability modeling of digital I&C systems. The NUREG/CR-6901 has also concluded that benchmark systems should be defined to allow assessment of the dynamic methodologies proposed for the reliability modeling of digital I&C systems using a common set of hardware/ software/ firmware states and state transition data. This report: a) defines such a benchmark system based on the steam generator feedwater control system of an operating pressurized water reactor (PWR), b) provides procedures to illustrate how dynamic reliability models for the benchmark system can be constructed using DFM and Markov methodologies, and, c) illustrates how the resulting dynamic reliability models can be integrated into the probabilistic risk assessment (PRA) model of an existing PWR using SAPHIRE as an example ET/FT PRA tool. The report also discusses to what extent the DFM and the Markov methodology meet the requirements given in NUREG/CR-6901 for the reliability modeling of digital I&C systems. Some challenges are identified. It is concluded that it may be possible to meet most of these challenges by linking the existing ET/FT based plant PRA tools to dynamic methodologies through user friendly interfaces and using distributed computing. The challenge that is the most difficult to address is the acceptability of the failure data used. While it is also concluded that the proposed methods can be used to obtain qualitative information on the failure characteristics of digital I&C systems as well as quantitative, and, in that respect, can be helpful in the identification of risk important event sequences even if the data issue is not resolved, the report presents only a proof-of-concept study. Additional work is needed to validate the practicality of the proposed methods for other digital systems and resolve the challenges identified.
... However, on the other hand the impact of soft-errors and other non critical failures is increasing at much higher rate [Con03]. Some studies have indicated that up to 80% failures can be attributed to SEUs [IR86,DKCP94,KP09], which suggest that these cases will result in an unnecessary yield loss. This section describes a few techniques to discriminate between different error categories described earlier in the chapter. ...
Thesis
With the constant evolution and ever-increasing transistor densities in semiconductor technology, error rates are on the rise. Errors that occur on semiconductor chips can be attributed to permanent, transient or intermittent faults. Out of these errors, once permanent errors appear, they do not go away and once intermittent faults appear on chips, the probability that they will occur again is high, making these two types of faults critical. Transient faults occur very rarely, making them non-critical. Incorrect classification during manufacturing tests in case of critical faults, may result in failure of the chip during operational lifetime or decrease in product quality, whereas discarding chips with non-critical faults may result in unnecessary yield loss. Existing mechanisms to distinguish between the fault types are mostly rule-based, and as fault types start manifesting similarly as we move to lower technology nodes, these rules become obsolete over time. Hence, rules need to be updated every time the technology is changed. Machine learning approaches have shown that the uncertainty can be compensated with previous experience. In our case, the ambiguity of classification rules can be compensated by storing past classification decisions and learn from those for accurate classification. This thesis presents an effective solution to the problem of fault classification in VLSI chips using Support Vector Machine (SVM) based machine learning techniques.
... In the load-sharing system, the workload has to be shared by the remaining components, resulting in an increased load shared on each surviving component [5]. Many empirical studies of mechanical systems [6] and computer systems [7] have proved that the workload strongly affects the component failure rate. Scheuer [8] studied the reliability of -out-ofsystem when component failure induces higher failure rate in survivors. ...
Article
Full-text available
The -out-of- configuration is a typical form of redundancy techniques to improve system reliability, where at least -out-of- components must work for successful operation of system. When the components are degraded, more components are needed to meet the system requirement, which means that the value of has to increase. The current reliability analysis methods overestimate the reliability, because using constant ignores the degradation effect. In a load-sharing system with degrading components, the workload shared on each surviving component will increase after a random component failure, resulting in higher failure rate and increased performance degradation rate. This paper proposes a method combining a tampered failure rate model with a performance degradation model to analyze the reliability of load-sharing -out-of- system with degrading components. The proposed method considers the value of as a variable which is derived by the performance degradation model. Also, the load-sharing effect is evaluated by the tampered failure rate model. Monte-Carlo simulation procedure is used to estimate the discrete probability distribution of . The case of a solar panel is studied in this paper, and the result shows that the reliability considering component degradation is less than that ignoring component degradation.
Article
This paper, for the first time, models and optimizes the uploading and downloading pace distribution in a production-dual storage system that must supply a certain demand during a specified mission time. The surplus product generated by the production unit is uploaded to available storage unit(s), which may be subsequently downloaded to supply the system demand in the event of the failure of the production unit. The uploading and downloading paces of the two storage units can greatly affect failure probabilities of the storage units and further the mission success probability (MSP). This paper makes advancement in the state of the art by suggesting a probabilistic approach for evaluating the MSP of the considered production-dual storage system. The optimal uploading and downloading pace distribution problem is then solved to maximize the MSP. Two case studies of water supply systems respectively with identical and dissimilar tanks are conducted to illustrate the proposed model and optimal uploading and downloading pace distribution solutions. Impacts of several system parameters (system demand, performance of the production unit, reliability and initial amount of product in storage units) and their interactions on the MSP and optimization solutions are also investigated through the case studies.
Article
With the network function virtualization technology, a middlebox can be deployed as software on commercial servers rather than on dedicated physical servers. A backup server is necessary to ensure the normal operation of the middlebox. The workload can affect the failure rate of backup server; the impact of workload-dependent failure rate on backup server allocation considering unavailability has not been extensively studied. This paper proposes a shared backup allocation model of middlebox with consideration of the workload-dependent failure rate of backup server. Backup resources on a backup server can be assigned to multiple functions. We observe that a function has four possible states and analyze the state transitions within the system. Through the queuing approach, we compute the probability of each function being available or unavailable for a certain assignment, and obtain the unavailability of each function. The proposed model is designed to find an assignment that minimizes the maximum unavailability among functions. We develop a simulated annealing algorithm to solve this problem. We evaluate and compare the performances of proposed and baseline models under different experimental conditions. Based on the results, we observe that, compared to the baseline model, the proposed model reduces the maximum unavailability by an average of 29% in our examined cases.
Article
The loading level applied during a system's operation can greatly affect the system's productivity and time-to-failure distribution. The optimal loading problem aims to determine loading levels leading to the best system performance. While this problem has been solved for many types of technical systems, very little work was devoted to production systems (PSs) with storage and the existing model assumed the storage is fully reliable and failed to consider effects of the storage's loading levels. This paper makes contributions by solving the optimal loading problem for an imperfect PS having an unreliable storage with possibility to choose different load levels that determine the storage's uploading and downloading paces and load levels that determine the productivity of the PS. Moreover, the downloading load level is chosen dynamically depending on the beginning time of the storage downloading. We evaluate and minimize the expected cumulative unsupplied demand (EUD) during the mission, and perform a detailed case study of a water supply system to demonstrate the proposed model and optimal loading policy solutions. Influences of several model parameters (mission time, PS's reliability, storage's reliability, capacity and initial amount) and their interactions on the EUD and optimization solutions are also examined through examples.
Article
It is a common practice to use product storage to enhance the system operation efficiency and mission success probability (MSP). However, very few studies in the reliability literature considered the storage component and none of the existing models addressed the optimal loading problem. This paper contributes by analyzing and maximizing the MSP for repairable systems with load-dependent performance and product storage of limited capacity. The storage is used to cumulate surplus product when the system performance exceeds the demand and compensate the deficiency when the system has insufficient performance or is failed and under repair. Both time-to-failure and time-to-repair are random, following arbitrary distributions. A numerical MSP evaluation algorithm is put forward for the considered repairable system with specified demand and mission time. As another contribution, the optimal loading problem is solved to determine the loading policy that maximizes the MSP. A case study on a repairable pump system in a chemical reactor is provided to examine the influences of storage capacity and initial storage amount on the MSP and the optimized loading policy.
Article
Fault tolerance and load balancing are two key roles in resource allocation against failures. This paper proposes a primary and backup resource allocation model with preventive recovery priority setting to minimize a weighted value of unavailable probability (W-UP) against multiple failures. W-UP considers the probability of unsuccessful recovery and the maximum unavailable probability after recovery among physical nodes. We consider that each node fails with a workload-dependent failure probability; each failure pattern occurs with a probability. The workload-dependent failure probability is a non-decreasing function revealing an empirical relationship between the workload and the failure probability for each physical node. We introduce a recovery strategy to handle the workload variation which is determined at the operation start time and can be applied for each failure pattern. Once a failure pattern occurs, the recoveries are operated according to the priority setting to promptly recover the functions hosted by failed nodes. We also discuss an approach to obtain unsuccessful recovery probability with considering the maximum number of arbitrary recoverable functions by a set of available nodes without the priority setting. We formulate the optimization problem as a mixed integer linear programming (MILP) problem. We develop a heuristic algorithm to solve larger size problems in a practical time. The developed heuristic algorithm is approximately 729 times faster than the MILP approach with 1.6% performance penalty on W-UP. The numerical results observe that the proposed model reduces W-UP compared with baselines.
Article
In this article, we discuss an optimal loading for items with lifetimes described by failure models that are popular in reliability and statistics. The obtained results can be relevant, for example, for production/manufacturing systems. The expected productivity and the mission success probability are maximized with respect to the value of a load. It is shown that the optimal load for the considered settings is not necessarily the load that maximizes the production rate. The crucial function in our discussion is the production rate over the load. It is shown that, depending on the model, the optimal load can be equal, smaller or larger than the value of the load that achieves the maximum of this function. The accelerated life and the proportional hazards failure models are considered, as well as the additive hazards model. Illustrative examples confirm our theoretical findings.
Article
This paper proposes an optimization model to derive a primary and backup resource allocation considering a workload-dependent failure probability to minimize the maximum expected unavailable time (MEUT). The workload-dependent failure probability is a non-decreasing function which reveals the relationship between the workload and the failure probability. The proposed model adopts hot backup and cold backup strategies to provide protection. The cold backup strategy is a protection strategy, in which the requested loads of backup resources are not activated before failures occur to reduce resource utilization with the cost of longer recovery time. The hot backup strategy is a protection strategy, in which the backup resources are activated and synchronized with the primary resources to recover promptly with the cost of higher workload. We formulate the optimization problem as a mixed integer linear programming (MILP) problem. We prove that MEUT of the proposed model is equal to the smaller value between the two MEUTs obtained by applying only hot backup and cold backup strategies with the same total requested load. A heuristic algorithm inspired by the water-filling algorithm is developed with the proved theorem. The numerical results show that the proposed model suppresses MEUT compared with the conventional model which does not consider the workload-dependent failure probability. The developed heuristic algorithm is approximately 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">5</sup> times faster than the MILP approach with 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">−2</sup> performance penalty on MEUT.
Article
This paper proposes a multiple backup resource allocation model with a workload-dependent failure probability to minimize the maximum expected unavailable time (MEUT) under a protection priority policy. The workload-dependent failure probability is a non-decreasing function which reveals the relationship between the workload and the failure probability. The proposed model adopts hot backup and cold backup strategies to provide protection. For protection of each function with multiple backup resources, it is required to adopt a suitable priority policy to determine the expected unavailable time. We analyze the superiority of the protection priority policy for multiple backup resources in the proposed model; we provide the theorems that clarify the influence of policies on MEUT. We formulate the optimization problem as a mixed integer linear programming (MILP) problem. We provide a lower bound of the optimal objective value in the proposed model. We prove that the decision version of the multiple resource allocation problem in the proposed model is NP-complete. A heuristic algorithm inspired by the water-filling algorithm is developed with providing an upper bound of the expected unavailable time obtained by the algorithm. The numerical results show that the proposed model reduces MEUT compared to baselines. The priority policy adopted in the proposed model suppresses MEUT compared with other priority policies. The developed heuristic algorithm is approximately 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">6</sup> times faster than the MILP approach with 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-4</sup> performance penalty on MEUT.
Article
This paper studies a load sharing k-out-of-n:F system such that its components degrade over time. Degradation of components causes dependent competing failures including soft and hard failures. After each failure, the extra workload redistributes on the surviving components and makes the change in distribution of degradation and hazard rate function of components. Furthermore, it makes a dependency property between the failure time of the components at consecutive stages of load sharing. Conditional distribution of the soft and hard failures as well as reliability function of the system are investigated. The maximum likelihood estimators of the unknown parameters and reliability function are also derived numerically. Moreover, a comprehensive example is presented to illustrate the proposed procedure in the paper. Finally, a simulation study is done and the parameter uncertainty problem is studied.
Article
Higher reliability and lower energy consumption are conflicting, yet among the most important design objectives for the real-time systems. Moreover, in the domain of real-time systems, non-preemptive scheduling is relatively unexplored with objectives such as reliability and energy. Thus we propose an active replication based framework to schedule a set of periodic real-time tasks in the non-preemptive heterogeneous environment such that the given reliability and timing constraints are satisfied whereas the energy consumption is minimized. First, we formulate the problem as a constraint optimization problem that provides an optimal solution; however, it does not scale well. Thus, we also propose heuristics which apply reservation of processors and reallocation of jobs, to compute suboptimal solution efficiently in terms of energy consumption as well as schedulability. Heuristics make use of the interplay of task-level reliability target, reliability of replicas, number of replicas, reliability of tasks, and energy consumption. We perform an experimental study on the test cases generated by extending UUnisort algorithm [1] and observe the effect of various simulation parameters on energy consumption and schedulability.
Conference Paper
We find ourselves in an age where there are over 1 billion estimated smartphones used globally, and with the advent of modern mobile technologies, information is seldom more than a few clicks away. This fact leads to the question: with information from others at our fingertips, why think for ourselves? The purpose of this study is to create a framework to improve the visibility of relevant and reliable information though existing resources. The research presented in this paper investigates how information is being sourced by students and proposes a solution that improves learner critical thinking in the context of searching for academic resources over the internet and selecting the most appropriate and relevant ones.
Article
Multi-state is a characteristic of advanced manufacturing systems and complicated engineering systems. Multi-state systems (MSSs)have gained considerable popularity in the last few decades due to their reliability. In this study, the load optimization problem for MSSs is investigated from the perspective of cumulative performance. The cumulative performance of MSSs and the corresponding mission success probability (MSP)are formulated for both infinite and finite time horizons. The distribution of the cumulative performance of a system at failure or a particular time is evaluated using a set of multiple integrals. Correspondingly, two load optimization models are formulated to identify the optimal loading strategy for each state of an MSS to achieve the maximum MSP. As an example, a set of comparative studies are performed to demonstrate the advantages of the proposed method. The results show that (1)the proposed method can effectively evaluate the MSP from a cumulative performance perspective in a computationally efficient manner, and (2)the optimal loading strategy of an MSS can be determined by the proposed method, while varying with respect to the set amount of work to be completed and the maximum allowable mission time.
Article
Common bus performance sharing (CBPS) abounds in diverse applications such as transportation, power supply and collaborative computing to facilitate efficient utilization of limited system resources. This paper models such a CBPS system with repairable components performing the main system function and repairable bus lines for redistributing surplus performance of some system components to components undergoing performance deficiency. Both time-to-failure and time-to-repair of system units (including components and lines) are random and may follow any arbitrary types of distributions. A probabilistic model based on discrete-state continuous-time stochastic processes is proposed for evaluating instantaneous availability of these repairable system units. The proposed model addresses the general repair policy encompassing minimal repair, perfect repair and imperfect repair. The universal generating function technique is further implemented for evaluating the system level performance metrics, including instantaneous system availability, instantaneous expected performance deficiency, expected system availability and total expected unsupplied demand during a specified mission time. Examples are provided to illustrate the proposed evaluation methodology and its application in prioritizing maintenance improvement actions.
Book
Full-text available
Proceedings of the 13th Information Technology and Telecommunication Conference
Article
Full-text available
This monograph is an attempt to establish a framework for Operations in Financial Services as a research area from an Operations Management perspective. Operations in Financial Services has not developed itself yet as a well-defined research area within the Operations Management community. It has been touched upon by researchers from various different disciplines, including Operations Management, Statistics, Information Technology, Finance, and Marketing. However, each discipline has a different perspective on what the important issues are and the various disciplines are often at odds with one another. This monograph has been written from an Operations Management perspective.
Article
We consider a resource allocation model to analyze investment strategies for financial services firms in order to minimize their operational risk losses. A firm has to decide how much to invest in human resources and in infrastructure (information technology). The operational risk losses are a function of the activity level of the firm, of the amounts invested in personnel and in infrastructure, and of interaction effects between the amounts invested in personnel and infrastructure. We first consider a deterministic setting and show certain monotonicity properties of the optimal investments assuming general loss functions that are convex. We find that because of the interaction effects “economies of scale" may not hold in our setting, in contrast to a typical manufacturing environment. We then consider a general polynomial loss function in a stochastic setting with the number of transactions at the firm being a random variable. We characterize the asymptotic behaviors of the optimal investments in both heavy and light trading environments. We show that when the market is very liquid, that is, it is subject to heavy transaction volumes, it is optimal for a financial firm that is highly risk sensitive to use a balanced investment strategy. Both a heavier right tail of the distribution of transaction volume and a firm's risk sensitivity necessitate larger investments; in a heavy trading environment these two factors reinforce one another. However, in a light trading environment with the transaction volume having a heavy left tail the investment will be independent of the firm's sensitivity to risk.
Article
Dynamic voltage scaling (DVS) is a technique which is widely used to save energy in a real time system. Recent research shows that it has a negative impact on the system reliability. In this paper, we consider the problem of the system reliability and focus on a periodic task set that the task instance shares resources. Firstly, we present a static low power scheduling algorithm for periodic tasks with shared resources called SLPSR which ignores the system reliability. Secondly, we prove that the problem of the reliability-aware low power scheduling for periodic tasks with shared resources is NP-hard and present two heuristic algorithms called SPF and LPF respectively. Finally, we present a dynamic low power scheduling algorithm for periodic tasks with shared resources called DLPSR to reclaim the dynamic slack time to save energy while preserving the system reliability. Experimental results show that the presented algorithm can reduce the energy consumption while improving the system reliability.
Article
Today is an era where multiprocessor technology plays a major role in designs of modern computer architecture. While multiprocessor systems offer extra computing power, it also opens a new range of opportunities to improve fault-robustness. This paper focuses on a problem of achieving fault-tolerance using replications in real-time, multiprocessor systems. In the problem, multiple replicas, or copies, of a computing task are executed on distinct processors to resist potential processor failures and computing faults. Two greedy, approximation heuristics, named Worst Fit Increasing K-Replication and First Fit Increasing K-Replication, are studied to maximise the number of real-time tasks assigned on a system with identical processors, respecting to the tasks’ replicating and timely requirements. Worst case performance is analysed by using an approximation ratio between the algorithms and an optimal solution. We mathematically prove that the ratios of using both algorithms are infinitely close to 2. Simulations are performed on a large set of testing cases which can be used to bring to light the average performance of using the algorithms in practice. The results show that both heuristic algorithms provide simple but fast and effective solutions to solve the problem.
Chapter
The University of Illinois has been active in research in the fault-tolerant computing field for over 25 years. Fundamental ideas have been proposed and major contributions made by researchers at the University of Illinois in the areas of testing and diagnosis, concurrent error detection, and fault tolerance. This paper traces the origins of these ideas and their development within the University of Illinois, as well as their influence upon research at other institutions, and outlines current directions of research.
Chapter
An approach for assessing the impact of physical injection of transient faults on control flow behaviour is described and evaluated. The fault injection is based on two complementary methods using heavyion radiation and power supply disturbances. A total of 6,000 transient faults was injected into the target microprocessor, an MC6809E 8-bit CPU, running three different benchmark programs. In the evaluation, the control flow errors were distinguished from those that had no effect on the correct flow of control, vis. the control flow OKs. The errors that led to wrong results are separated from those that had no effect on the correct results. The errors that had no effect on either the correct flow or the correct result are specified. Three error detection techniques, namely two software-based techniques and one watchdog timer, were combined and used in the test in order to characterize the detected and undetected errors. It was found that more than 87% of all errors and 93% of the control flow errors could be detected.
Article
The Dynamic Voltage Scaling (DVS) technology which is widely used in numerous energy management schedule has negative effect on system reliability. Based on the artificial bee colony algorithm (ABC), two noble reliability-aware schedule algorithms are proposed for DVS systems with discrete frequencies, which will meet the energy constraint and deadline naturally while maximize the reliability of the system. The simulation results indicate that the dynamic schedule algorithm outperforms the static schedule algorithm, and its performance is close to that of the optimal scheduler that knows the exact workload in advance.
Article
A series-parallel system was proposed with common bus performance sharing in which the performance and failure rate of the element depended on the load it was carrying. In such a system, the surplus performance of a sub-system can be transmitted to other deficient sub-systems. The transmission capacity of the common bus performance sharing mechanism is a random variable. Effects of load on element performance and failure rate were considered in this paper. A reliability evaluation algorithm based on the universal generating function technique was suggested. Numerical experiments were conducted to illustrate the algorithm. Copyright © 2014 Editorial Department of Journal of Donghua University. All rights reserved.
Article
This paper considers a multi-period production system where a set of machines are arranged in parallel. The machines are unreliable and the failure rate of machine depends on the load assigned to the machine. The expected production rate of the system is considered to be a non-monotonic function of its load. Because of the machine failure rate, the total production output depends on the combination of loads assigned to different machines. We consider the integration of load distribution decisions with production planning decision. The product demands are considered to be known in advance. The objective is to minimize the sum of holding costs, backorder costs, production costs, setup costs, capacity change costs and unused capacity costs while satisfying the demand over specified time horizon. The constraint is not to exceed available repair resources required to repair the machine breakdown. The paper develops two heuristics to solve the integrated load distribution and production planning problem. The first heuristic consists of a three-phase approach, while the second one is based on tabu search metaheuristic. The efficiency of the proposed heuristics is tested through the randomly generated problem instances.
Article
In a parallel structure load-sharing system, the failure rate of the operating components will usually increase, due to the additional loading induced by the other components’ failure. Hence failure dependency exists among components. To quantify the failure dependency, a dependence function is introduced. Under the assumptions that the repair time distributions of components are arbitrary and life times are exponential distributions whose failure rates vary with the number of operating components, a new load-sharing parallel system with failure dependency is proposed. To model the stochastic behavior of the system, the Semi-Markov process induced by it is given. The Semi-Markov kernel associated with the process is also presented. The availability and the time to the first system failure are obtained by employing Markov renewal theory. A numerical example is presented to illustrate the results obtained in the paper. The impact of the failure dependence on the system is also considered.
Article
This paper considers single-component repairable systems supporting different levels of workloads and subject to random repair times. The mission is successful if the system can perform a specified amount of work within the maximum allowed mission time. The system can work with different load levels, each corresponding to different productivity, time-to-failure distribution, and per time unit operation cost. A numerical algorithm is first suggested to evaluate mission success probability and conditional expected cost of a successful mission for the considered repairable system. The load optimization problem is then formulated and solved for finding the system load level that minimizes the expected mission cost subject to providing a desired level of the mission success probability. Examples with discrete and continuous load variation are provided to illustrate the proposed methodology. Effects of repair efficiency, repair time distribution, and maximum allowed time on the mission reliability and cost are also investigated through the examples.
Article
Experimental data on transient faults from several digital computer systems are presented and analyzed. This research is significant because earlier work on validation of reliability models has concentrated only on permanent faults. The systems for which data have been collected are the DEC PDP-10 series computers, the C. vmp fault tolerant microprocessor, and the Cm* multiprocessor. Current results show that transient faults do not occur with constant failure rates as has been commonly assumed. Instead, the data for all three systems indicate Weibull distributions with decreasing failure rates.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
In this paper some measures are presented that characterize both the performance and reliability of digital computing systems in time sharing environments from a user viewpoint. The measures (Apparent Capacity and Expected Elapsed Time required to correctly execute a given program) are based on a mathematical model built upon traditional assumptions. The model is a hybrid in that is uses statistics gathered from a real system while giving analytical expressions for other statistics such as the Expected Elapsed time. The main parameters of the model are the system workload and the distribution of the time between errors. Although still limited because of the restrictive assumptions used, the model gives quantitative results about how much a user can expect from a time sharing system, as a function of the system workload and reliability. For example, this study measured a four to one range in mean time to system failure as a function of system load. For the maximum load period measured the model predicts a 40 % contribution from system unreliability to expected computation time for a program that could require 30 minutes of CPU time in an unloaded situation. (Author)
Article
In this paper a new modeling methodology to characterize failure processes in Time-Sharing systems due to hardware transients and software errors is presented. The basic assumption made is that the instantaneous failure rate of a system resource can be approximated by a deterministic function of time plus a zero-mean stationary Gaussian process, both depending on the usage of the resource considered. The probability density function of the time to failure obtained under this assumption has a decreasing hazard function, partially explaining why other decreasing hazard function densities such as the Weibull fit experimental data so well. Furthermore, by considering the Operating System kernel as a system resource, this methodology sets the basis for independent methods of evaluating the contribution of software and hardware to system unreliability. The modeling methodology has been validated with the analysis of a real system. The predicted system behavior according to this methodology is compared with the predictions of other models such as the exponential, Weibull, and periodic failure rate. The implications of this methodology are discussed and some applications are given in the areas of Performance/Reliability modeling, software reliability evaluation, models incorporating permanent hardware faults, policy optimization, and design optimization. (Author)
Article
A great deal has been written during the past few years on the subject of diagnostic test procedures for digital systems. Almost without exception, however, the investigators have limited their interest to the detection and location of solid faults, and their test procedures are usually based on the assumption that either the fault exists for the running time of the test procedure or the time interval between the fault occurrence is less than the required time to run the test.
Article
In this correspondence we present a statistical model which relates mean computer failure rates to level of system activity. Our analysis reveals a strong statistical dependency of both hardware and software component failure rates on several common measures of utilization (specifically CPU utilization, I/O initiation, paging, and job-step initiation rates). We establish that this effect is not dominated by a specific component type, but exists across the board in the two systems studied. Our data covers three years of normal operation (including significant upgrades and reconfigurations) for two large Stanford University computer complexes. The complexes, which are composed of IBM mainframe equipment of differing models and vintage, run similar operating systems and provide the same interface and capability to their users. The empirical data comes from identically structured and maintained failure logs at the two sites along with IBM OS/VS2 operating system performance/load records.
Article
In digital circuits there is typically a delay between the occurrence of a fault and the first error in the output. This delay is the error latency of the fault. A model to characterize the error latency of a fault in a sequential circuit is presented. Random testing of sequential circuits is analyzed using the error-latency model (ELM). For a desired quality of test, the necessary length of the random test may be specified. Conversely, the quality of a test with a fixed length may be calculated. The accuracy of a previous analysis of random testing is shown to be quite poor in some cases. Copyright © 1976 by The Institute of Electrical and Electronics Engineers, Inc.
The error latency of a fault in a posium on Fault-Tolerant Computing. He is also a consultant to industry, in combinational circuit Fault-Tolerant particular IBM, in the area of reliable computing
  • J J Shedletsky
  • E J Mccluskey
J. J. Shedletsky and E. J. McCluskey, "The error latency of a fault in a posium on Fault-Tolerant Computing. He is also a consultant to industry, in combinational circuit," in Proc. 3rd Annu. Int. Symp. Fault-Tolerant particular IBM, in the area of reliable computing. Comput., June 1973. In 1977 Dr. Iyer was awarded a prize and plaque in the IEEE Student Paper
Probabilistic Reliability: An Engineering Ap-Contest. The same year, he also received the best Australian Student Paper proach Award from the Australian section of the IEEE
  • M L Shooman
M. L. Shooman, Probabilistic Reliability: An Engineering Ap-Contest. The same year, he also received the best Australian Student Paper proach. New York: McGraw-Hill, 1968. Award from the Australian section of the IEEE. He is a member of the
Remarks on the probability of detecting faults
  • N L Gunther
  • W C Carter
Model and random testing properties of intermittent faults in combinational circuits
  • J Savir
System Management Facilities (SMF). IBM, Order GC35- 0004
  • Os Vs