General fault-tolerance requirements and features of telecommunication switching control systems are outlined and a development of fault-tolerance techniques on a multiple microprocessor system is briefly described.
was formed by the TC on Fault-Tolerant Computing of the .... large part to be of relevance to computer-based systems, ...... usage: 1) fault prevention, tolerance, and diagnosis, ..... Petri nets for quantitative evaluation), or they can be used ...
This paper presents a design optimisation tool for distributed embedded real-time systems that 1) decides mapping, fault-tolerance policy and generates a fault-tolerant schedule, 2) is targeted for hard real-time, 3) has hard reliability goal, 4) generates static schedule for processes and messages, 5) provides fault-tolerance for k transient/soft faults, 6) optimises for minimal energy consumption, while considering impact of lowering voltages on the probability of faults, 7) uses constraint logic programming (CLP) based implementation.
Distributed fault-tolerance can mask the effect of a limited number of permanent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and sel...
Critical embedded systems need a dependable operating system and application. Despite all efforts to prevent and remove faults in system development, residual software faults usually persist. Therefore, critical systems need some sort of faulttolerance to deal with these faults and also with hardwa...
We present an approach to the synthesis of fault-tolerant hard real-time systems for safety-critical applications. We use checkpointing with rollback recovery and active replication for tolerating transient faults. Processes and communications are statically scheduled. Our synthesis approach decides the assignment of fault-tolerance policies to processes, the optimal placement of checkpoints and the mapping of processes to processors such that multiple transient faults are tolerated and the timing constraints of the application are satisfied. We present several design optimization approaches which are able to find fault-tolerant implementations given a limited amount of resources. The developed algorithms are evaluated using extensive experiments, including a real-life example.
In this paper, we propose an approach for achieving detection and identification of faults, and provide faulttolerant control for systems that are modeled using timed hybrid Petri nets. For this purpose, an observer based technique is adopted which is useful in detection of faults, such as sensor faults, actuator faults, signal conditioning faults, etc. The concepts of estimation, reachability and diagnosability have been considered for analyzing faulty behaviors, and based on the detected faults, different schemes are proposed for achieving faulttolerant control using optimization techniques. These concepts are applied to a typical three tank system and numerical results are obtained.
In this paper, we propose an approach for achieving detection and identification of faults, and provide faulttolerant control for systems that are modeled using timed hybrid Petri nets. For this purpose, an observer based technique is adopted which is useful in detection of faults, such as sensor faults, actuator faults, signal conditioning faults, etc. The concepts of estimation, reachability and diagnosability have been considered for analyzing faulty behaviors, and based on the detected faults, different schemes are proposed for achieving faulttolerant control using optimization techniques. These concepts are applied to a typical three tank system and numerical results are obtained. PMID:21507399
monitor, in real-time,the signals from FTMP's system bus. The systemwas ... The Fault-Tolerant Multi-Processor system is a test-bed used for fault- .... Where the "_ -_" symbol means bidirectional data flow. A data rate estimate of each path link ...
This work addresses the issue of design optimization for fault- tolerant hard real-time systems. In particular, our focus is on the handling of transient faults using both checkpointing with rollback recovery and active replication. Faulttolerant schedules are generated based on a conditional process graph representation. The formulated system synthesis approaches decide the assignment of fault-tolerance policies to processes, the optimal placement of checkpoints and the mapping of processes to processors, such that multiple transient faults are tolerated, transparency requirements are considered, and the timing constraints of the application are satisfied.
The active fault-tolerant control approach relies heavily on the occurred faults. In order to improve the safety of the reconfigurable system, a methodology to incorporate actuator health in the fault-tolerant control design is proposed for a tracking problem. Indeed, information about actuator heal...
Fault-tolerant computing is the art and science of building computer systems that continue to operate normally in the presence of faults. The faulttolerance field covers a wide spectrum of research area ranging from computer hardware to computer software. A common approach t...
The use of software diversity (SD) to achieve faulttolerance in industrial control systems is examined, reviewing the results of recent experimental investigations and applications. Topics addressed include safety systems for rail transport, experimental safety systems for nuclear reactors, the control software for the Airbus and ATR aircraft, tolerating software design faults, and reliability modeling for fault-tolerant software. Extensive diagrams and flow charts are provided.
THE VALIDATION PROCESS FOR FAULT-TOLERANTSYSTEMS SUMMARY ..... . 3. 2.0 ...... of preventive maintenance a The study of the short-term effects of random failures l. Diagnosis of intermittent faults. 78 ...... to a human observer or to a ...
Jul 1, 2008 ... systems developed to characterize the response of a fault-tolerant ..... through more advanced scalable diagnosis strategies and robust distributed .... classifies the behavior of the SUT by applying a multi-observer-based fault ...
This report describes an experiment in the design of a general purpose faulttolerantsystem, FTM. The main objective of the FTM design was to implement a "low-cost" faulttolerantsystem that could be used on standard workstations. At the operating system level, our goal was to provide a methodolog...
Runtime verification (RV) is a natural fit for ultra-critical systems, where correctness is imperative. In ultra-critical systems, even if the software is fault-free, because of the inherent unreliability of commodity hardware and the adversity of operational environments, processing units (and their hosted software) are replicated, and fault-tolerant algorithms are used to compare the outputs. We investigate both software monitoring in distributed fault-tolerantsystems, as well as implementing fault-tolerance mechanisms using RV techniques. We describe the Copilot language and compiler, specifically designed for generating monitors for distributed, hard real-time systems. We also describe two case-studies in which we generated Copilot monitors in avionics systems.
This paper interests to the reconfigurability of fault-tolerant control system based on the reliability analysis of components. The aim of this work is to present the need of reliability analysis in fault-tolerant control design. The admissibility of system reconfigurability with respect to the reli...
This paper interests to the properties of fault-tolerant control system with respect to reliability of components. The aim of this paper is to present the need of reliability analysis in fault-tolerant control design. The focus of our study is on the reconfigurability analysis of systems with actuat...
This paper presents results for an automatic navigation system for agricultural vehicles. The system uses stereo-vision, inertial sensors and GPS. Special emphasis has been placed on modeling the natural environment in conjunction with a fault-tolerant navigation system. The results are exemplified by an agricultural vehicle following cut grass (swath). It is demonstrated how faults in the system can be detected and diagnosed using state of the art techniques from fault-tolerant literature. Results in performing fault-diagnosis and fault accomodation are presented using real data.
Sensors are important parts of the navigation system. The decentralized filtering for sensor faulttolerance has great significance for improving system reliability. The method combining federated filtering and covariance intersection filtering is adopted to achieve the decentralized filtering of sensor faulttolerance, and the navigation accuracy and tolerance are analyzed. The simulation results indicate that the implementation of decentralized filtering on integrated navigation system that composed of multi sensors can bring the benefits of high accuracy and improve the system capability of faulttolerance due to the information redundancy. Thus, more completed and accurate navigation information about the movable sensor carrier can be provided.
Combinatorial models, such as fault trees, digraphs and reliability block diagrams, usually use cutset-based techniques for quantitative analysis. Recent research has suggested that the Binary Decision Diagram (BDD) offers an efficient solution alternative. However, combinatorial models by themselves are not sufficient for the analysis of faulttolerantsystems unless augmented by a coverage model, which assesses the effectiveness of the recovery mechanisms incorporated for faulttolerance. In this paper, we describe the DREDD algorithm (Dependability and Risk Evaluation using Decision Diagrams), which effectively combines the BDD solution method for a combinatorial model with the solution of a coverage model. Three example faulttolerantsystems are analyzed using DREDD.
Assumptions useful for faulttolerant quantum computing are stated and briefly discussed. We focus on assumptions related to properties of the computational system. The strongest form of the assumptions seems to be sufficient for achieving highly faulttolerant quantum computation. We discuss weakenings which are also likely to suffice.
and Reconfiguration (FDIR) Manager for the triply redundant, tightly ... construction of a “distributed, fault and damage tolerant, real time information processing ... system is composed of several FaultTolerant Processors that are linked ..... but the time penalty for not using highly optimized assembly language code to align ...
Algorithms have been developed allowing operation of robotic systems under damaged conditions. Specific areas addressed were optimal sensor location, adaptive nonlinear control, fault-tolerant robot design, and dynamic path-planning. A seven-degree-of-freedom, hydraulic manipulator, with fault-tolerant joint design was also constructed and tested. This report completes this project which was funded under the Laboratory Directed Research and Development program.
Every system of any significant size is created by composition from smaller sub-systems or components. It is thus fruitful to analyze the fault-tolerance of a system as a function of its composition. In this paper, two basic types of system composition are described, and an algebra to describe faulttolerance of composed systems is derived. The set of systems forms monoids under the two composition operators, and a semiring when both are concerned. A partial ordering relation between systems is used to compare their fault-tolerance behaviors.
Abstract This paper focuses on the problem of active fault-tolerant control for switched systems with time delay. By utilizing the fault diagnosis observer, an adaptive fault estimate algorithm is proposed, which can estimate the fault signal fast and exactly. Meanwhile, a delay-dependent criterion is obtained with the purpose of reducing the conservatism of the adaptive observer design. Based on the fault estimation information, an observer-based fault-tolerant controller is designed to guarantee the stability of the closed-loop system. In terms of linear matrix inequality, sufficient conditions are derived for the existence of the adaptive observer and fault-tolerant controller. Finally, a numerical example is included to illustrate the efficiency of the proposed approach. Copyright 2011...
The vehicle's onboard guidance system, the faulttolerant inertial navigation unit ( FTINU), begins a system alignment. ... Assemble and brief post-launch inspection teams. ... Ares I-X flight termination system and solid rocket motors are armed.
This paper deals with the problem of active fault-tolerant control (FTC) for time-delay Takagi-Sugeno (T-S) fuzzy systems based on a fuzzy adaptive fault diagnosis observer (AFDO). A novel fuzzy fast adaptive fault estimation (FAFE) algorithm for T-S fuzzy models is proposed to enhance the performance of fault estimation, and sufficient conditions for the existence of the fault estimator are given in terms of linear matrix inequalities (LMIs). Using the obtained on-line fault estimation information, an observer-based fast active fault-tolerant controller is designed to compensate for the effect of faults by stabilizing the closed-loop system. Simulation results of a track trail system and a nonlinear numerical example are presented to illustrate the effectiveness of the proposed method.
The wireless sensor network is a resource-constrained self-organizing system that consists of a large number of tiny sensor nodes. Due to the low-cost and low-power nature of sensor nodes, sensor nodes are failure-prone when sensing and processing data. Most presented fault-tolerant research for wireless sensor networks focused on crash faults or power faults and less on Byzantine faults. Hence, in this paper, we propose a power-saving data aggregation algorithm for Byzantine faults to provide power savings and high success rates even in the environment with high fault rates. The algorithm utilizes the concept of Byzantine masking quorum systems to mask the erroneous values and to finally determine the correct value. Our simulation results demonstrate that when the fault rate of sensor nodes is up to 50%, our algorithm still has 48% success rate to obtain the correct value. Under the same condition, other fault-tolerant algorithms are almost failed.
In this paper fault-tolerant control is investigated for a class of nonlinear systems with model errors or external disturbance. Firstly a fault diagnosis method is proposed to estimate the faults based on the adaptive state observer. Then a novel fault-tolerant control scheme is designed to stabilize the nonlinear systems via the feedback linearization technique. The main advantage of the approach is that it is easy to implement and in most cases it can place the poles of the closed-loop systems arbitrarily. The effectiveness of the proposed scheme is demonstrated through the numerical simulation.
Fault diagnosis in large-scale systems that are products of modern technology present formidable challenges to manufacturers and users. This is due to large number of failure sources in such systems and the need to quickly isolate and rectify failures with minimal down time. In addition, for fault-tolerantsystems and systems with infrequent opportunity for maintenance (e.g., Hubble telescope, space station), the assumption of at most a single fault in the system is unrealistic. In this project, we have developed novel block and sequential diagnostic strategies to isolate multiple faults in the shortest possible time without making the unrealistic single fault assumption.
In highly automated aerospace and industrial systems where maintenance and repair cannot be carried out immediately, it is crucial to design control systems capable of ensuring desired performance when taking into account the occurrence of faults/failures on a plant/process; such a control technique is referred to as faulttolerant control (FTC). The control system processing such faulttolerance capability is referred to as a faulttolerant control system (FTCS). The objective of FTC is to maintain system stability and current performance of the system close to the desired performance in the presence of system component and/or instrument faults; in certain circumstances a reduced performance may be acceptable. Various control design methods have been developed in the literature with the t...
The purpose of this research project is to improve current onboard decision support systems. Special focus is on the onboard prediction of the instantaneous sea state. In this project a new approach to increasing the overall reliability of a monitoring and decision support system has been established. The basic idea is to convert the given system into a fault-tolerantsystem and to improve multi-sensor data fusion for the particular system. The background of the project is the SeaSense system, which has been installed on several container ships and navy vessels. The SeaSense system provides a crude and simple estimation of the actual sea state (Hs and Tz), information about the longitudinal hull girder loading, seakeeping performance of the ship, and decision support on how to operate the ship within acceptable limits. The system is able to identify critical forthcoming events and to give advice regarding speed and course changes to decrease the wave-induced loads. The SeaSense system is based on the combineduse of a mathematical model and measurements from a set of sensors. The overall dependability of a shipboard monitoring and decision support system such as the SeaSense system can be improved using fault-tolerant techniques (Fault Diagnosis and System Re-design) and a Sensor Fusion Quality (SFQ) test. Fault diagnosis means to detect the presence of faults in the system. In case sea state estimation is conducted by a ship-wave buoy analogy the best solution is achieved when a set of three different ship responses are used. Faulty signals should be discarded from the procedure for sea state estimation if it is possible, if not the fault should be estimated. The fault diagnosis can be divided into three steps: Fault detection, fault isolation and fault estimation. Fault detection means to decide whether or not a fault has occurred. This step determines the time at which the system is subjected to the given fault. Fault isolation will find in which component a fault has occurred. This step determines the location of the fault. Fault estimation provides an estimate of magnitude of a fault. A supervisory function determines the severity of the fault once its origin has been isolated and its magnitude estimated. Fault-tolerant Sensor Fusion means that the monitoring and decision support system can accommodate faults so that the overall system continues to satisfy its goal and on the other hand in the absence of a fault, the system should be able to provide the most accurate information using the SFQ test.
New fault diagnosis (FD) and faulttolerant control (FTC) algorithms for non-Gaussian singular stochastic distribution control (SDC) systems are presented in this paper. Different from general SDC systems, in singular SDC systems, the relationship between the weights and the control input is expressed by a singular state space model, which increases the difficulty in the FD and FTC design. The proposed approach relies on an iterative learning observer (ILO) for fault estimation. The fault may be constant, fast-varying or slow-varying. Based on the estimated fault information, the faulttolerant controller can be designed to make the post-fault probability density function (PDF) still track the given distribution. Simulations are given to show the effectiveness of the proposed FD and FTC al...
In order to design a faulttolerant digital system, it is necessary to understand its behavior with the presence of faults in early design stage. Fault simulation is a process to purposely inject faults into a software circuit model and observe its faulty behavior. However, such simulation's runtime tends to grow exponentially when test circuit becomes large. FPGA-based fault emulation, which is fault simulation implemented in FPGA, is an efficient way to accelerate fault simulation process. This paper introduces a novel FPGA-based switch-level fault emulation system utilizing module-based dynamic partial reconfiguration. In this approach, faults are modeled at switch-level and mapped to gate-level description for efficient FPGA implementation. The circuit under test is partitioned using u...
In this paper a procedure for modelling satellite formations including failure dynamics as a piecewise-affine hybrid system is shown. The formulation enables recently developed methods and tools for control and analysis of piecewise-affine systems to be applied leading to synthesis of faulttolerant controllers and analysis of the system behaviour given possible faults. The method is illustrated using a simple example involving two satellites trying to reach a specific formation despite of actuator faults occurring.
Free-piston Stirling power conversion has been considered a candidate for ... architecture, systems were designed with kinematic Stirling engines with rotary .... and GRC, with various degrees of detail and different concepts for faulttolerance ...
Several techniques such as neural networks, fuzzy logic, expert systems, etc have been ... recognition and classification, financial prediction, and signal processing. Neural .... a faulttolerant engine control scheme can be applied. The first step ...
Generation Tool (APGEN) will be modified to a ... high-speed scalable faulttolerant distributed avionics ... distributed system functionally executing VxWorks, ..... MATLAB. The TFP tool is currently used by missions like Deep Space 1, Cassini, ...
Avionics and control systems for aircraft use distributed, fault-tolerant computer sys- ...... symmetric, meaning that whatever the effect, it is the same for all observers ... ment (e.g., group membership, consensus, diagnosis) cannot be solved in ...
A property observed in high reliability faulttolerant control systems is the relatively rare occurrence of component failures compared to the frequent occurrence of redundancy management decision events. This property leads to a temporal decomposition of...
5.4.1 UNIVERSITY RESEARCH IN DEPENDABLE ...... software-based systems, and for using technology interventions to achieve a predictable .... as faulttolerance, transparent computing, legacy code translation, and universal programming ...
autonomous self-healing eDNA architecture is expensive in terms of execution time. .... The package transfer in the network is implemented with a fault-tolerant ..... 2009 NASA/ESA. Conference on Adaptive Hardware Systems (2009) 155– ...
methodology. It also contains a blank Architectural Assessment Form in the back of the document— ..... completely faulttolerant to support both mission success & total safety) ..... For the system being assessed, all basic principles have been ...
Consensus is at the heart of fault-tolerant distributed computing systems. Much research has been devoted to developing algorithms for this particular problem. This paper presents a semi-automatic verification approach for asynchronous consensus algorithms, aiming at facilitating their development. ...
Papers are presented at a conference on array processors and parallelism. Topics include the following: mapping algorithms to architectures; image processing; faulttolerance in systollic arrays; performance; testing; and multiple-bus multiprocessor systems.
The goal of this project was to understand what key interchangeable elements form a scalable, fault-tolerant quantum information systems architecture. The effort was a collaboration between computer science and physical sciences involving four groups: MIT...
In networked embedded systems, runtime adaptive software promises an increase of flexibility, faulttolerance and extensibility. Often, this requires that software components have to be allocated dynamically to execution platforms at runtime. Hence, the platforms have to execute dynamically changing...
A new set-up for faulttolerant control (FTC) for stable systems is presented in this paper. The new set-up is based on a simple implementation of the Youla-Jabr-Bongiorno-Kucera (YJBK) parameterization. This implementation of the YJBK parameterization will allow a direct and simple reconfiguration of the feedback controller. Another central part of faulttolerant control is fault diagnosis. The controller implementation can be applied directly in connection with both passive diagnosis (PFD) as well as with active fault diagnosis (AFD). The presented FTC set-up is investigated with respect to sensor reconfiguration. Actuator reconfiguration can be dealt with in a similar way.
In this paper, an active FaultTolerant Control (FTC) strategy is developed for Switched Hybrid Systems. The main contribution concerns the design of a linear Output Feedback dedicated to Switched Hybrid System. Based on an available Fault Detection, Isolation (FDI) scheme, the controllers redesign ...
This paper presents an effective scheme for detecting incipient faults in post-faultsystems (PFSs) subject to adaptive fault-tolerant control (AFTC). Through a survey of existing techniques, it is shown that the adaptivity of the AFTC counteracts the effect of an incipient fault in the PFS. This makes some of the conventional fault-detection strategies, such as Beard-Jones detection filters and adaptive observers, ineffective in this situation. It is shown that the unknown input observer (UIO) is an effective tool; hence, the UIO is designed to decouple the incipient fault from the AFTC such that the fault-detection residual is sensitive only to the incipient fault. Extensive simulation study is presented using an aircraft example to test three fault-detection approaches; it is demonstrat...
Abstract in english This paper proposes a reflective object-oriented architecture for developing fault-tolerant software. Reflective object-oriented programming promotes a modular structuring of systems by means of a new dimension of modularization?the separation between base-level objects and meta-level objects. This property allows the creation of metaobjects responsible for managing tasks of application objects located at the base level. In the context of this work, computational reflect (more) ion is applied to implement various strategies of faulttolerance at the meta-level in a transparent manner for the application programmer, that is, without interfering with the original structure of application objects that require faulttolerance facilities. The use of the proposed architecture has the following advantages: (i) separation of concerns, that is, separate the concerns related to the application domain from those related to the implementation of fault-tolerant mechanisms; (ii) it promotes code reuse of fault-tolerance mechanisms; (iii) it allows application programmers to use the most adequate fault-tolerance strategy for his implementation, and (iv) it provides a design that is more adaptable, flexible and easier to extend than traditional designs for developing fault-tolerant software. Our reflective architecture is composed of three levels, and is based on the abstraction of object groups.
Faults in steering, navigation instruments or propulsion machinery are serious on a marine vessel since the consequence could be loss of maneuvering ability, and imply risk of damage to vessel personnel or environment. Early diagnosis and accomodation of faults could enhance safety. Fault-tolerant control is a methodology to help prevent that faults develop into failure. The means include on-line fault diagnosis, automatic condition assessment and calculation of remedial action to avoid hazards. This paper gives an overview of methods to obtain fault-tolerance: fault diagnosis; analysis of properties of a falty system; means to determine remedial actions. The paper illustrates the techniques by two marine examples, sensor fusion for automatic steering and control of the main engine.
Instead of a quantum computer where the fundamental units are 2-dimensional qubits, the author can consider a quantum computer made up of d-dimensional systems. There is a straightforward generalization of the class of stabilizer codes to d-dimensional systems, and he will discuss the theory of fault-tolerant computation using such codes. He proves that universal fault-tolerant computation is possible with any higher-dimensional stabilizer code for prime d.
The proliferation of increasingly powerful and complex multiprocessor systems has made fault-tolerant design a necessity. Optimizing faulttolerance in multiprocessor systems is a very difficult task because it involves multi-dimensional tradeoffs. The system architecture, the computation structure, the implementation technology, the frequency, duration, and location of faults, and many other factors all have certain impact on the effectiveness of a particular fault recovery procedure. The author has attempted to solve this difficult problem by a graph theoretic approach. In this dissertation, he introduces this approach and concentrate on the analysis and optimization of faulttolerance in multiprocessor systems. Specifically, a reconfiguration model that allows a faulted job to be recovered with minimum space and time overhead and without performance degradation is formally introduced. Additionally, eleven parameters are precisely defined to facilitate the evaluation of the faulttolerance of different multiprocessor systems for executing a given set of target applications. They also allow the quantitative comparison of various fault recovery techniques so that efficient algorithms can be developed. The graph theoretic approach presented is widely applicable to multiprocessor systems and applications with various topologies. In this dissertation, the author concentrates on two well-known systems, namely, the mesh and hypercube, and two frequently used computation structures, namely, the path an complete binary tree. Solutions and algorithms for determining various optimization parameters are presented.
Application layer techniques (ALTs) have been suggested as add-on techniques that will improve the overall faulttolerance of a system on top of the faulttolerance provided by the hardware and operating systems - level techniques. Compared to the techniques in the other two layers, ALTs have the advantage that they are flexible and less expensive. In this paper we discuss three varieties of ALTs namely control-flow checking using assertions (CCA), algorithm-based faulttolerance (ABFT), and multi-version objects (MVOs). The three approaches are relatively orthogonal in the sense that application of any combination of the techniques improves the faulttolerance of the system in a complementary fashion. Illustrative examples are provided for each technique.
The prime objective of Fault-tolerant Control (FTC) systems is to handle faults and discrepancies using appropriate accommodation policies. The issue of obtaining information about various parameters and signals, which have to be monitored for fault detection purposes, becomes a rigorous task with the growing number of subsystems. The structural approach, presented in this paper, constitutes a general framework for providing information when the system becomes complex. The methodology of this approach is illustrated on the ship propulsion benchmark.
The effect of sensor faults in direct torque control(DTC) based induction motor drives is analyzed and a new Instrument fault detection isolation scheme(IFDIS) is proposed. The proposed IFDIS, which operates in real-time, detects and isolates the incipient fault(s) of speed sensor and current sensors that provide the feedback information. The scheme consists of an adaptive gain scheduling observer as a residual generator and a special sequential test logic unit. The observer provides not only the estimate of stator flux, a key variable in DTC system, but also the estimates of stator current and rotor speed that are useful for fault detection. With the test logic, the IFDIS has the functionality of fault isolation that only multiple estimator based IFDIS schemes can have. Simulation results for various type of sensor faults show the detection and isolation performance of the IFDIS and the applicability of this scheme to faulttolerant control system design. (author). 21 refs., 10 figs., 5 tabs.
Abstract in english This paper introduces a methodology for modeling and analyzing fault-tolerant manufacturing systems that not only optimizes normal productive processes, but also performs detection and treatment of faults. This approach is based on the hierarchical and modular integration of Petri Nets. The modularity provides the integration of three types of processes: those representing the productive process, fault detection, and fault treatment. The hierarchical aspect of the approac (more) h permits us to consider processes on different levels of detail (i.e. factory, manufacturing cell, or machine). Case studies considering detection and treatment of faults are presented, and a simulation tool is applied for verifying the models.
In this paper, the problem of active faulttolerant control for a reusable launch vehicle (RLV) with actuator fault using both adaptive and sliding mode techniques is investigated. Firstly, the kinematic equations and dynamic equations of RLV are given, which represent the characteristics of RLV in reentry flight phase. For the dynamic model of RLV in faulty case, a fault detection scheme is proposed by designing a nonlinear fault detection observer. Then, an active faulttolerant tracking strategy for RLV attitude control systems is presented by making use of both adaptive control and sliding mode control techniques, which can guarantee the asymptotic output tracking of the closed-loop attitude control systems in spite of actuator fault. Finally, simulation results are given to demonstrat...
Self-stabilization is a versatile approach to fault-tolerance since it permits a distributed system to recover from any transient fault that arbitrarily corrupts the contents of all memories in the system. Byzantine tolerance is an attractive feature of distributed systems that permits to cope with arbitrary malicious behaviors. We consider the well known problem of constructing a maximum metric tree in this context. Combining these two properties leads to some impossibility results. In this paper, we provide two necessary conditions to construct maximum metric tree in presence of transients and (permanent) Byzantine faults.
Service-Oriented Architecture (SOA) is widely adopted for building mission-critical systems, ranging from on-line stores to complex airline management systems. How to build reliable SOA systems becomes a big challenge due to the compositional nature of Web services. This paper proposes an adaptive QoS-aware faulttolerance strategy for Web services. Based on a user-collaborated QoS-aware middleware, SOA systems can dynamically adjust their optimal faulttolerance configurations to achieve optimal service reliability as well as good overall performance. Both the subjective user requirements and the objective system performance of the Web services are considered in our adaptive faulttolerance strategy. Experiments are conducted to illustrate the advantages of the proposed adaptive fault tol...
This paper proposes a fault-tolerant control (FTC) scheme for polytopic linear parameter varying (LPV) systems. First, a fault compensator is proposed. Its structure is simple, but it is effective against actuator faults. Then, in order to show its basic idea, the authors consider a state feedback FTC scheme based on the fault compensator. Next, the FTC algorithm is extended to an output feedback FTC scheme. The FTC scheme for LPV systems involves a fault compensator based on the estimated fault and an observer based on linear matrix inequality (LMI). The proposed FTC method is applicable to a variety of systems and guarantees bounded states of the system in the event of actuator faults. Numerical examples are given to demonstrate its effectiveness.
This program demonstrated the integration of a number of technologies that can increase the availability and reliability of launch vehicles while lowering costs. Availability is increased with an advanced guidance algorithm that adapts trajectories in real-time. Reliability is increased with fault-tolerant computers and communication protocols. Costs are reduced by automatically generating code and documentation. This program was realized through the cooperative efforts of academia, industry, and government. The NASA-LaRC coordinated the effort, while Draper performed the integration. Georgia Institute of Technology supplied a weak Hamiltonian finite element method for optimal control problems. Martin Marietta used MATLAB to apply this method to a launch vehicle (FENOC). Draper supplied the fault-tolerant computing and software automation technology. The fault-tolerant technology includes sequential and parallel fault-tolerant processors (FTP & FTPP) and authentication protocols (AP) for communication. Fault-tolerant technology was incrementally incorporated. Development culminated with a heterogeneous network of workstations and fault-tolerant computers using AP. Draper's software automation system, ASTER, was used to specify a static guidance system based on FENOC, navigation, flight control (GN&C), models, and the interface to a user interface for mission control. ASTER generated Ada code for GN&C and C code for models. An algebraic transform engine (ATE) was developed to automatically translate MATLAB scripts into ASTER.
Distributed fault-tolerance can mask the effect of a limited number of per- manent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed point) tasks, or assume synchronous scheduling of the system components. In this paper, we present the first study of dynamic tasks in asynchronous systems, considering the unison problem as a benchmark. Unison can be seen as a local clock syn- chronisation problem as neighbors must maintain digital clocks at most one time unit away from each other, and increment their own clock value infinitely often. We present many im- possiblity results for this difficult problem and propose a FTSS solution when the problem is solvable that exhibits optimal fault containment.
Hydraulic actuators are complex fluid power devices whose performance can be degraded in the presence of systemfaults. In this thesis a linear, fixed-gain, faulttolerant controller is designed that can maintain the positioning performance of an electrohydraulic actuator operating under load with a leaking piston seal and in the presence of parametric uncertainties. Developing a control systemtolerant to this class of internal leakage fault is important since a leaking piston seal can be difficult to detect, unless the actuator is disassembled. The designed faulttolerant control law is of low-order, uses only the actuator position as feedback, and can: (i) accommodate nonlinearities in the hydraulic functions, (ii) maintain robustness against typical uncertainties in the hydraulic system parameters, and (iii) keep the positioning performance of the actuator within prescribed tolerances despite an internal leakage fault that can bypass up to 40% of the rated servovalve flow across the actuator piston. Experimental tests verify the functionality of the faulttolerant control under normal and faulty operating conditions. The faulttolerant controller is synthesized based on linear time-invariant equivalent (LTIE) models of the hydraulic actuator using the quantitative feedback theory (QFT) design technique. A numerical approach for identifying LTIE frequency response functions of hydraulic actuators from acceptable input-output responses is developed so that linearizing the hydraulic functions can be avoided. The proposed approach can properly identify the features of the hydraulic actuator frequency response that are important for control system design and requires no prior knowledge about the asymptotic behavior or structure of the LTIE transfer functions. A distributed hardware-in-the-loop (HIL) simulation architecture is constructed that enables the performance of the proposed faulttolerant control law to be further substantiated, under realistic operating conditions. Using the HIL framework, the faulttolerant hydraulic actuator is operated as a flight control actuator against the real-time numerical simulation of a high-performance jet aircraft. A robust electrohydraulic loading system is also designed using QFT so that the in-flight aerodynamic load can be experimentally replicated. The results of the HIL experiments show that using the faulttolerant controller to compensate the internal leakage fault at the actuator level can benefit the flight performance of the airplane.
The Networked Control Systems (NCS) are complex systems which integrate information provided by several domians such as automatic control, computer science, communication network. The work presented in this paper concerns fault detection, isolation and compensation of communication network. The proposed method is based on the classical approach of Fault Detection and Isolation and FaultTolerant Control (FDI/FTC) currently used in diagnosis. The modelling of the network to be supervised is based on both couloured petri nets and network calculus theory often used to represent and analyse the network behaviour. The goal is to implement inside network devices algorithms enabling to detect, isolate and compensate communication faults in an autonomous way.
Fault-tolerant control (FTC) for the space-borne equipments is very important in the engineering design. This paper presents a two-layer intelligent FTC approach to handle the speed stability problem in the swing-arm system suffering from various faults in space. This approach provides the reliable FTC at the performance level, and improves the control flow error detection capability at the code level. The faults degrading the system performance are detected by the performance-based fault detection mechanism. The detected faults are categorized as the anticipated faults and unanticipated faults by the fault bank. Neural network is used as an on-line estimator to approximate the unanticipated faults. The compensation control and intelligent integral sliding mode control are employed to accommodate two types of faults at the performance level, respectively. To guarantee the reliability of the FTC at the code level, the key parts of the program codes are modified by control flow checking by software signatures (CFCSS) to detect the control flow errors caused by the single event upset. Meanwhile, some of the undetected control flow errors can be detected by the FTC at the performance level. The FTC for the anticipated fault and unanticipated fault are verified in Synopsys Saber, and the detection of control flow error is tested in the DSP controller. Simulation results demonstrate the efficiency of the novel FTC approach.
A parallel algorithm for constructing k-valued fault-tolerant diagnostic tests is described. This algorithm combines two algorithms, viz. a parallel algorithm for constructing an irredundant implication matrix designed to distinguish objects from different patterns and a parallel algorithm for constructing irredundant h-fold column coverings. The IMSLOG intelligent instrumental software (IIS), on the basis of which we construct intelligent systems for various disciplines is described. A sufficient condition for constructing diagnostic tests tolerant to the given number of measurement (entry) errors of values of characteristic features of the object under investigation is applied to ensure fault-tolerance. Suggestions for further research are given.
This paper addresses the design and comparison of active and passive fault-tolerant linear parameter-varying (LPV) controllers for wind turbines. The considered wind turbine plant model is characterized by parameter variations along the nominal operating trajectory and includes a model of an incipient fault in the pitch system. We propose the design of an active fault-tolerant controller (AFTC) based on an existing LPV controller design method and extend this method to apply for the design of a passive fault-tolerant controller (PFTC). Both controllers are based on output feedback and are scheduled on the varying parameter to manage the parametervarying nature of the model. The PFTC only relies on measured system variables and an estimated wind speed, while the AFTC also relies on information from a fault diagnosis system. Consequently, the optimization problem involved in designing the PFTC is more difficult to solve, as it involves solving bilinear matrix inequalities (BMIs) instead of linear matrix inequalities (LMIs). Simulation results show the performance of the active faulttolerant control system to be slightly superior to that of the passive fault-tolerant control system.
It will help test the shape, guidance and other systems for the X-37. ... SIGI Space Integrated Global Positioning System Inertial Navigation System; test of control ... string system is not as reliable as the fault-tolerantsystem planned for the X-37.
The engineering breadboard implementation for the CDRL no. D001 modular digital computer system developed during design of the logic system was documented. This effort followed the architecture study completed and documented previously, and was intended to verify the concepts of a faulttolerant, automatically reconfigurable, modular version of the computer system conceived during the architecture study. The system has a microprogrammed 32 bit word length, general register architecture and an instruction set consisting of a subset of the IBM System 360 instruction set plus additional faulttolerance firmware. The following areas were covered: breadboard packaging, central control element, central processing element, memory, input/output processor, and maintenance/status panel and electronics.
Motorized antenna is a key element in overseas satellite telecommunication. The control system directs the on-board antenna toward a chosen satellitewhile the high sea waves disturb the antenna. Certain faults (communication system malfunction or signal blocking) cause interruption in the communication connection resulting in loss of the tracking functionality, and instability of theantenna. In this brief, a faulttolerant control (FTC) system is proposed for thesatellite tracking antenna. The FTC system maintains the tracking functionality by employing proper control strategy. A robust fault diagnosis system is designed to supervise the FTC system. The employed fault diagnosis solution is able to estimate the faults for a class of nonlinear systems acting under external disturbances. Effectiveness of the method is verified through implementation and test on an antenna system.
Faulttolerant control of dynamic processes is investigated in this paper using an auto-tuning PID controller. A faulttolerant control scheme is proposed composing an auto-tuning PID controller based on an adaptive neural network model. The model is trained online using the extended Kalman filter (EKF) algorithm to learn system post-fault dynamics. Based on this model, the PID controller adjusts its parameters to compensate the effects of the faults, so that the control performance is recovered from degradation. The auto-tuning algorithm for the PID controller is derived with the Lyapunov method and therefore, the model predicted tracking error is guaranteed to converge asymptotically. The method is applied to a simulated two-input two-output continuous stirred tank reactor (CSTR) with various faults, which demonstrate the applicability of the developed scheme to industrial processes. PMID:15719931
This paper addresses faulttolerant control for position mooring of a shuttle tanker operating in the North Sea. A complete framework for fault diagnosis is presented but the loss of a sub-sea mooring line buoyancy element is given particular attention, since this fault could lead to line breakage and risky abortion of an oil-loading operation. With signicant drift forces from waves, non-Gaussian elements dominate in residuals and fault diagnosis need be designed using dedicated change detection for the type of distribution encountered. In addition to dedicated diagnosis, an optimal position algorithm is proposed to accommodate buoyancy element failure and keep the mooring system in a safe state. Detection properties and fault-tolerant control are demonstrated by high delity simulations
In many proposed architectures for quantum computing the physics of the system prevent qubits from being individually controlled. In such systems universal computation may be possible via bulk manipulation of the system. Here, we describe a method to execute globally controlled quantum information processing which admits a faulttolerant quantum error correction scheme. Our scheme nominally uses three species of addressable two-level systems which are arranged in a one dimensional array in a specific periodic arrangement. We show that the scheme possesses a faulttolerant error threshold.
This paper studies the problem of fault accommodation of time-varying delay systems using adaptive fault diagnosis observer. Based on the proposed fast adaptive fault estimation (FAFE) algorithm using only a measured output, a delay-dependent criteria is first established to reduce the conservatism of the design procedure, and the FAFE algorithm can enhance the performance of fault estimation. On the basis of fault estimation, the observer-based fault-tolerant tracking control is then designed to guarantee tracking performance of the closed-loop systems. Furthermore, comprehensive analysis is presented to discuss the calculation steps using linear matrix inequality technique. Finally, simulation results of a stirred tank reactor model are presented to illustrate the efficiency of the propo...
In this paper, a supervisor system, able to diagnose different types of faults during the operation of a proton exchange membrane fuel cell is introduced. The diagnosis is developed by applying Bayesian networks, which qualify and quantify the cause-effect relationship among the variables of the process. The fault diagnosis is based on the on-line monitoring of variables easy to measure in the machine such as voltage, electric current, and temperature. The equipment is a fuel cell system which can operate even when a fault occurs. The fault effects are based on experiments on the faulttolerant fuel cell, which are reproduced in a fuel cell model. A database of fault records is constructed from the fuel cell model, improving the generation time and avoiding permanent damage to the equipmen...
Manufacturing defects in the deep sub-micron VLSI process and aging resulted problems of devices during lifecycle are inevitable, and fault-tolerant routing algorithms are important to provide the required communication for NoCs in spite of failures. The proposed algorithm, referred to as scalable and reconfigurable fault-tolerant distributed routing (RFDR), partitions the system into nine regions using the concept of divide-and-conquer. It is a distributed algorithm, and each router guarantees fault-tolerance within one's own region and the system can be still sustained with multiple fault areas. The proposed RFDR has excellent scalability with hardware cost keeping constant independent of system size. Also it is completely reconfigurable when new nodes fail. Simulations under various synthetic traffic patterns show its better performance compared to Extended-XY routing algorithm. Moreover, there is almost no hardware overhead compared to Logic-Based Distributed Routing (LBDR), but the fault-tolerance capacity is enhanced in the proposed algorithm. Hardware cost is reduced 37% compared to Reconfigurable Distributed Scalable Predictable Interconnect Network (R-DSPIN) which only supports single fault region.
An active fault diagnosis (AFD) method will be considered in this paper in connection with a FaultTolerant Control (FTC) architecture based on the YJBK parameterization of all stabilizing controllers. The architecture consists of a fault diagnosis (FD) part and a controller reconfiguration (CR) part. The FTC architecture can be applied for additive faults, parametric faults, and for system structural changes. Only parametric faults will be considered in this paper. The main focus in this paper is on the use of the new approach of active fault diagnosis in connection with FTC. The active fault diagnosis approach is based on including an auxiliary input in the system. A fault signature matrix is introduced in connection with AFD, given as the transfer function from the auxiliary input to the residual output. This can be considered as a generalization of the passive fault diagnosis case, where the diagnosis is only based on a residual vector. The fault diagnosis is then derived by on-line tests by using the residual vector.
Recent experimental advances have demonstrated technologies capable of supporting scalable quantum computation. A critical next step is how to put those technologies together into a scalable, fault-tolerantsystem that is also feasible. We propose a Quantum Logic Array (QLA) microarchitecture that forms the foundation of such a system. The QLA focuses on the communication resources necessary to efficiently support fault-tolerant computations. We leverage the extensive groundwork in quantum error correction theory and provide analysis that shows that our system is both asymptotically and empirically faulttolerant. Specifically, we use the QLA to implement a hierarchical, array-based design and a logarithmic expense quantum-teleportation communication protocol. Our goal is to overcome the primary scalability challenges of reliability, communication, and quantum resource distribution that plague current proposals for large-scale quantum computing.
Various papers on computers in aerospace studies are presented. The general topics addressed include: real-time hardware/software issues, GaAs and RISC processor architectures, system software development, knowledge-based systems in aerospace applications, verification and validation of expert systems, spaceborne processor architecture, autonomous systems, configuration management, diagnostics and fault monitoring, signal processors, principles of software reuse, AI initiatives in the Air Force, architecture for telescience, intelligent tutoring systems, satellite architecture, Ada, software reuse tools, and intelligent maintenance aids. Also considered are: modeling and simulation environments, project and software management, advanced fault-tolerant computer architecture, spacecraft command and control, faulttolerance for software-intensive systems, system acquisition management, neural nets, crew-systems integration, model-based approaches to diagnostics, parallel processing applications, software requirements engineering, planning and scheduling, software safety, computer security, and real-time embedded AI systems.
This book presents papers on supercomputers as given at a conference on computers and communications. Topics considered at the conference included dataflow programming languages, compilers, distributed systems, natural language, prolog, logic programming, computer vision and computer-aided processes, expert systems, computer system design and architecture, parallel processing, data base management, logic circuits, and faulttolerant computers.
This document is furnished as part of the effort to develop NRC Class 1E Digital Computer Systems Guidelines which is Task 8 of USAF Rome Laboratories Contract F30602-89-D-0100. The report addresses four major topics, namely, computer programming languages, software design and development, software testing and faulttolerance and fault avoidance. The topics are intended as stepping stones leading to a Draft Regulatory Guide document. As part of this task a small scale survey of software fault avoidance and faulttolerance practices was conducted among vendors of nuclear safety related systems and among agencies that develop software for other applications demanding very high reliability. The findings of the present report are in part based on the survey and in part on review of software literature relating to nuclear and other critical installations, as well as on the authors` experience in these areas.
This paper describes a new approach and a system SCREEN for fault-tolerant speech parsing. SCREEEN stands for Symbolic Connectionist Robust EnterprisE for Natural language. Speech parsing describes the syntactic and semantic analysis of spontaneous spoken language. The general approach is based on incremental immediate flat analysis, learning of syntactic and semantic speech parsing, parallel integration of current hypotheses, and the consideration of various forms of speech related errors. The goal for this approach is to explore the parallel interactions between various knowledge sources for learning incremental fault-tolerant speech parsing. This approach is examined in a system SCREEN using various hybrid connectionist techniques. Hybrid connectionist techniques are examined because of their promising properties of inherent faulttolerance, learning, gradedness and parallel constraint integration. The input for SCREEN is hypotheses about recognized words of a spoken utterance potentially analyzed by a spe...
Systolic arrays are a popular model for the implementation of highly parallel VLSI systems. In this paper interstitial faulttolerance (IFT), a technique for incorporating faulttolerance into systolic arrays in a natural manner, is discussed. IFT can be used for reliable computation or for yield enhancement. Previous faulttolerance techniques for reliable computation on SIMD systems have employed redundant hardware. IFT on the other hand employs time redundancy. Previous wafer scale integration techniques for yield enhancement have been proposed only for linear processing element arrays. Ift is effective for both linear and two dimensional arrays. The time redundancy to achieve IFT is shown to be bounded by a factor of 3, allowing no processor redundancy. Results of monte carlo simulation of ift are presented. 19 references.
Early Orion GN&C system designs optimized for robustness, simplicity, and utilization of commercially available components. During the System Definition Review (SDR), all subsystems on Orion were asked to re-optimize with component mass and steady state power as primary design metrics. The objective was to create a mass reserve in the Orion point of departure vehicle design prior to beginning the PDR analysis cycle. The Orion GN&C subsystem team transitioned from a philosophy of absolute 2 faulttolerance for crew safety and 1 faulttolerance for mission success to an approach of 1 faulttolerance for crew safety and risk based redundancy to meet probability allocations of loss of mission and loss of crew. This paper will discuss the analyses, rationale, and end results of this activity regarding Orion navigation sensor hardware, control effectors, and trajectory design.
However, the faulttolerance community has ... NASA/Langley Research Center has developed one such fault-tolerant computing platform .... that is locally detectable to all good observers are .... On-line Diagnosis Protocol for the SPIDER ...
Hadoop is an open-source data processing framework that includes a scalable, fault-tolerant distributed file system, HDFS. Although HDFS was designed to work in conjunction with Hadoop's job scheduler, we have re-purposed it to serve as a grid storage element by adding GridFTP and SRM servers. We have tested the system thoroughly in order to understand its scalability and faulttolerance. The turn-on of the Large Hadron Collider (LHC) in 2009 poses a significant data management and storage challenge; we have been working to introduce HDFS as a solution for data storage for one LHC experiment, the Compact Muon Solenoid (CMS).
Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of FaultTolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve faulttolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.
Hadoop is an open-source data processing framework that includes a scalable, fault-tolerant distributed file system, HDFS. Although HDFS was designed to work in conjunction with Hadoop's job scheduler, we have re-purposed it to serve as a grid storage element by adding GridFTP and SRM servers. We have tested the system thoroughly in order to understand its scalability and faulttolerance. The turn-on of the Large Hadron Collider (LHC) in 2009 poses a significant data management and storage challenge; we have been working to introduce HDFS as a solution for data storage for one LHC experiment, the Compact Muon Solenoid (CMS).
A controller and sensor faulttolerantsystem for a steam generator is designed with fuzzy logic. A structure of the proposed faulttolerant redundant system is composed of a supervisor and two fuzzy weighting modulators. A supervisor alternatively checks a controller and a sensor induced performances to identify which part, a controller or a sensor, is faulty. In order to analyze controller induced performance both an error and a change in error of the system output are chosen as fuzzy variables. The fuzzy logic for a sensor induced performance uses two variables : a deviation between two sensor outputs and its frequency. Fuzzy weighting modulator generates an output signal compensated for faulty input signal. Simulations show that the proposed faulttolerant control scheme for a steam generator regulates well water level by suppressing fault effect of either controllers or sensors. Therefore through duplicating sensors and controllers with the proposed faulttolerant scheme, both a reliability of a steam generator control and sensor system and that of a power plant increase even more. 2 refs., 9 figs., 1 tab. (Author)
Various strategies for achieving faulttolerance in large scale control systems are discussed. The positive and negative impacts of distribution through network communication are presented. The ATOMOS framework for standardized reliable marine automation is presented along with the corresponding reliability issues. A generic framework for simulation of network traffic under fault conditions is suggested and the first practical experiences from a prototype implementation are reported.
The prime objective of Fault-tolerant Control (FTC) systems is to handle faults and discrepancies using appropriate accommodation policies. The issue of obtaining information about various parameters and signals, which have to be monitored for fault detection purposes, becomes a rigorous task with the growing number of subsystems. The structural approach, presented in this report, constitutes a general framework for providing information when the system becomes complex. Furthermore, by using this approach, one can determine the calculation sequences of the residuals. The methodology of this approach is illustrated on the ship propulsion benchmark.
The problem of fault-tolerant controller design for a class of polytopic uncertain systems with actuator faults is studied in this paper. The actuator faults are presented as a more general and practical continuous fault model. Based on the affine quadratic stability (AQS), the stability of the polytopic uncertain system is replaced by the stability at all corners of the polytope. For a wide range of problems including Formula Not Shown and mixed Formula Not Shown controller design, sufficient conditions are derived to guarantee the robust stability and performance of the closed-loop system in both normal and fault cases. In the framework of the linear matrix inequality (LMI) method, an iterative algorithm is developed to reduce conservativeness of the design procedure. The effectiveness o...
Today's hardware technology presents a new challenge in designing robust systems. Deep submicron VLSI technology introduced transient and permanent faults that were never considered in low-level system designs in the past. Still, robustness of that part of the system is crucial and needs to be guaranteed for any successful product. Distributed systems, on the other hand, have been dealing with similar issues for decades. However, neither the basic abstractions nor the complexity of contemporary fault-tolerant distributed algorithms match the peculiarities of hardware implementations. This paper is intended to be part of an attempt striving to overcome this gap between theory and practice for the clock synchronization problem. Solving this task sufficiently well will allow to build a very robust high-precision clocking system for hardware designs like systems-on-chips in critical applications. As our first building block, we describe and prove correct a novel Byzantine fault-tolerant self-stabilizing pulse syn...
Dependability analysis is crucial to control the risks resulting from failures in modern industrial systems. This paper proposes a modeling approach that constructs dynamic models of fault-tolerant (FT) systems based on Stochastic Activity Networks (SANs). This approach allows the systematic inclusi...
Faulttolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a faulttolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.
We present a constraint logic programming (CLP) approach for synthesis of fault-tolerant hard real-time applications on distributed heterogeneous architectures. We address time-triggered systems, where processes and messages are statically scheduled based on schedule tables. We use process re-execution for recovering from multiple transient faults. We propose three scheduling approaches, which each present a trade-off between schedule simplicity and performance, (i) full transparency, (ii) slack sharing and (iii) conditional, and provide various degrees of transparency. We have developed a CLP framework that produces the fault-tolerant schedules, guaranteeing schedulability in the presence of transient faults. We show how the framework can be used to tackle design optimization problems.The proposed approach has been evaluated using extensive experiments.
Due to their distributed architecture, artificial neural networks often show a graceful performance degradation to the loss of few units or connections. Living systems also display an additional source of fault-tolerance obtained through distributed processes of self-healing: defective components are actively regenerated. In this paper, we present results obtained with a model of development for spiking neural networks undergoing sustained levels of cell loss. To test their resistance to faults, networks are subjected to random faults during development and mutilated several times during operation. Results show that, evolved to control simulated Khepera robots in a simple navigation task, plastic and non-plastic networks develop fault-tolerant structures which can recover normal operation to various degrees. PMID:16111863
Quantum computers are expected to offer phenomenal increases of computational power. In spite of many proposals based on various physical systems, scalable quantum computation in a fault-tolerant manner is still beyond current technology. Optical models have some prominent advantages such as relatively quick operation time compared to decoherence time. However, massive resource requirements and the gap between the faulttolerance limit and the realistic error rate should be significantly reduced. Here, we develop a novel approach with all-optical hybrid qubits devised to combine advantages of well-known previous approaches. It enables one to efficiently perform universal gate operations in a simple and near-deterministic way using all-optical hybrid entanglement as off-line resources. Remarkably, our approach outperforms the previous ones when considering both the resource requirements and faulttolerance limits. Our work paves an efficient way for the optical realization of scalable quantum computation.
LA-MPI is a high-performance, network-fault-tolerant implementation of MPl designcd for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and pcrformance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction that makes LA-MPI highly portable, givcs high-performance through mcssage striping, and niost importantly provides the basis for network faulttolerance. Finally we include some performance numbers for the Quadrics and UDP network paths.
A clustered neural network, in which neuronal information is represented by a cluster (population of neurons), rather than a single neuron, is a possible solution to construct fault-tolerant single-electron circuits. We designed single-electron circuits based on a clustered neural network that performs differential enhancement where differences between the cluster's outputs receiving various magnitudes of inputs are enhanced after the processing. Simulation results showed that the degradation of the performance of the clustered single-electron neural network was significantly lower than that of a non-clustered network, which indicates that this approach is one possible way to construct fault-tolerant computing systems on nanodevices.
We describe a single sided matrix converter (SSMC) designed for safety critical applications like flight control actuation systems. Dynamic simulations of multi-phase SSMC using Matlab Simulink are carried out to evaluate the faulttolerance capabilities. Investigation into different numbers of phases and power converter topologies under single phase open circuit, single switch open circuit, and single switch short circuit has been executed. The simulation results confirm 5-phase SSMC design as a compromise between faulttolerance and converter size/volume. A 5-phase SSMC prototype was built. Experimental results verify the effectiveness of our design.
With the rise in automation the increase in fault detectionand isolation & reconfiguration is inevitable. Interest in fault detection and isolation (FDI) for nonlinear systems has grown significantly in recent years. The design of FDI is motivated by the need for knowledge about occurring faults in fault-tolerant control systems (FTC systems). The idea of FTC systems is to detect, isolate, and handle faults in such a way that the systems can still perform in a required manner. One prefers reduced performance after occurrence of a fault to the shut down of (sub-) systems. Hence, the idea of fault-tolerance can be applied to ordinary industrial processes that are not categorized as high risk applications, but where high availability is desirable. The quality of fault-tolerant control is totally dependent on the quality of the underlying algorithms. They detect possible faults, and later reconfigure control software to handle the effects of the particular fault event. In the past mainly linear FDI methods were developed, but as most industrial plants show nonlinear behavior, nonlinear methods for fault diagnosis could probably perform better. This thesis considers the design of FDI for nonlinear systems. It consists of four different contributions. First, it presents a review of the idea and the theory behind the geometric approach for FDI. Starting from the original solution for linear systems up to the latest results for input-affine systems the theory and solutions are described. Then the geometric approach is applied to a nonlinear ship propulsion system benchmark. The calculations and application results are presented in detail to give an illustrative example. The obtained subsystems are considered for the design of nonlinear observers in order to obtain FDI. Additionally, an adaptive nonlinear observer design is given for comparison. The simulation results are used to discuss different aspects of the geometric approach, e.g. the possibility to use it as a general approach. The third contribution considers stability analysis of observers used for FDI. It gives proofs of stability for the observers designed for the ship propulsion system. Furthermore, it stresses the importance of the time-variant character of the linearization along a trajectory. It leads to a different stability analysis than for linearization at one operation point. Finally, the preliminary concept of (actuator) fault-output decoupling is described. It is a new idea based on the solution of the input-output decoupling problem. The idea is to include FDI considerations already during the control design.
A methodology for measuring and improving the faulttolerance characteristics of neural networks is presented. Sensitivity analysis and headroom analysis programs have been developed using fault models more realistic and appropriate for emulating hardware failures than those used previously. The potential mode of failure is simulated as a corruption of stored weight and threshold values. These analysis tools enable the faulttolerance characteristics of neural networks to be evaluated. It is demonstrated how functionally identical neural networks can have significantly different reliability characteristics should they be subjected to hardware platform failures. Criteria for selection of globally optimal architecture and trained state are discussed using results provided by the sensitivity and headroom analysis programs. These criteria, combined with empirical results, lead to implied design rules which can be adopted by engineers of neural networks for improving and maximizing the faulttolerance characteristics of the system. A novel modification to the backward error propagation training algorithm is discussed and evaluated on its effectiveness in improving the robustness of the trained network. The modification involves the deliberate injection of a small amount of random white noise on the network's weights and thresholds to expedite and increase the likelihood of a more optimal convergence stat occurring for the network from the aspect of faulttolerance. The methodology is demonstrated using an iterative design scenario and is shown to be effective under certain circumstance.
This paper presents an experimental evaluation of a hybrid fault detection and isolation scheme against three successive faults in skew-configured inertial sensors of an unmanned aerial vehicle (UAV). An additional small and low-cost inertial measurement unit is installed with a skewed angle to a primary inertial measurement unit. A parity space method and an in-lane monitoring method are combined to increase systemtolerance to the occurrence of multiple successive faults during flight. The first and second faults are detected and isolated by the parity space method. The third fault is detected by the parity space method and isolated by the in-lane monitoring method based on the discrete wavelet transform. Hardware in-the-loop tests and flight experiments with a fixed-wing UAV are perform...
The design and evaluation of highly reliable computer systems is a complex issue. Designers mostly develop such systems based on prior knowledge and experience and occasionally from analytical evaluations of simplified designs. A simulation-based environment called DEPEND which is especially geared for the design and evaluation of fault-tolerant architectures is presented. DEPEND is unique in that it exploits the properties of object-oriented programming to provide a flexible framework with which a user can rapidly model and evaluate various fault-tolerantsystems. The key features of the DEPEND environment are described, and its capabilities are illustrated with a detailed analysis of a real design. In particular, DEPEND is used to simulate the Unix based Tandem Integrity fault-tolerance and evaluate how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, re-integration policies, and workload dependent repair times which affect how the system handles near-coincident errors are also evaluated. Issues such as the method used by DEPEND to simulate error latency and the time acceleration technique that provides enormous simulation speed up are also discussed. Unlike any other simulation-based dependability studies, the use of these approaches and the accuracy of the simulation model are validated by comparing the results of the simulations, with measurements obtained from fault injection experiments conducted on a production Tandem Integrity machine.
The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining faulttolerance while exhibiting improved performance.
This paper develops a fault-tolerant variable sampling control (VSC) scheme for a class of nonlinear networked control systems (NCSs) with time-varying state and random network delays. An uncertain continuous Takagi-Sugeno (T-S) fuzzy system with both state and input varying delays, in the presence of possible actuator faults, is obtained equivalently on the basis of the input delay methodology. A tighter bounding lemma is proposed so as to gain less conservative closed-loop stability criteria. Delay-dependent conditions in terms of linear matrix inequalities are derived for the mode-independent fault-tolerant stabilizing controller of the resulting Markovian network-based system by employing a novel stochastic Lyapunov-Krasovskii (L-K) functional. An illustrative example is simulated to s...
The problem of reliability in high performance control and in faulttolerant control is considered in this paper. A feedback controller architecture for high performance and faulttolerance is considered. The architecture is based on the Youla-Jabr-Bongiorno-Kucera (YJBK) parameterization. By using the nominal controller in the architecture as a simple and robust controller, it is possible to use the YJBK transfer function for optimization of the closed-loop performance. This can be done both in connections with normal operation of the system as well as in connection with faults in the system. The architecture will also allow changing the applied sensors and/or actuators when switching between different controllers. This switchingget particular simple for open-loop stable systems.
The paper presents a stepwise procedure to develop a faulttolerant control system for small satellites. The procedure is illustrated through implementation on the AAUSAT-II spacecraft. As it is shown the presented procedure requires expertise from several disciplines that are nevertheless necessary for obtaining a complete and consistent solution.
informal reports representing results of PC-based research and development activities being .... tasks such as diagnosis, student modeling, and task selection. ... class are treated using Petri nets, data are entered specifying the internal states of the ob- ... for system monitoring, reconfiguration, faulttolerance and interactive ...
report evaluates documented schemes for speed sensorless drives, and discusses the trends and tradeoffs ..... filtering approach is its faulttolerance which permits system parameter drifts. Therefore, .... Fuzzy. Logic are gaining potential as estimators and controllers for many industrial ..... CLASSIFICATION. 18. SECURITY ...
We are designing, perhaps for the first time, closed-loop fault-tolerant control for uncertain nonlinear systems. Our solution is based on a new algebraic estimation technique of the derivatives of a time signal, which • yields good estimates of the unknown parameters and of the residuals, i.e., of ...
This paper deals with power extraction maximization and grid faulttolerance of a Doubly-Fed Induction Generator (DFIG)-based Wind Turbine (WT). These variable speed systems have several advantages over the traditional wind turbine operating methods, such as the reduction of the mechanical stress an...
We propose a technique for discrete controller synthesis, with optimal synthesis on bounded paths, in order to model, design, and optimize fault-tolerant distributed systems, taking into account several criteria (e.g., the execution costs of the tasks and their quality of service). Different combina...
View synchronous communication (VSC) is a paradigm initially proposed by the Isis system, that is well suited to implement fault-tolerant services based on replication. VSC can be seen as an adequate low level semantics on which ordered multicasts and uniform multicasts can easily be implemented. Th...
Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, scheduling and stream copy modules. This paper presents an overview of the SLURM architecture and functionality.
the development of effective faulttolerance methods and tools. ... 9 mm W : ' m;.,m W allows modeling and prediction of thoi behavior In a ... u-o-u > J _ launchllandin shock 8| vibration. and in-space thermal mana ment ' ... Low power design lor 'system on a chip' 'mtegralion . .... Error rates unacceptable during solar flares ...
Increasing soft error rates for semiconductor devices manu- factured in later technologies enforces the use of faulttolerant techniques such as Roll-back Recovery with Checkpointing (RRC). However, RRC introduces time overhead that increases the completion (execution) time. For non-real-time system...
This book presents papers given at a conference on artificial intelligence, supercomputers, and array processors. Topics considered at the conference included parallel processing, computer architecture, memory devices, algorithms, massively parallel processors, faulttolerant computers, reliability, communication technology, software technology expert systems, knowledge base, and data base management.
Apr 1, 2005 ... Lack of faulttolerance and cleanliness tolerance in vent paths, which results in time/labor-consuming ...... Lack of operational margin in dimensional design tolerances that “stack-up” and ...... Dredging Operations. 6. $0.40 ...
Object-oriented fault-tree representation unifies evaluation of reliability and diagnosis of faults. Programming/fault tree described more fully in "Object-Oriented Algorithm For Evaluation Of Fault Trees" (ARC-12731). Augmented fault tree object contains more information than fault tree object used in quantitative analysis of reliability. Additional information needed to diagnose faults in system represented by fault tree.
We show that thresholds for fault-tolerant quantum computation are solely determined by the quality of single-system operations if one allows for d-dimensional systems with $8 \\leq d \\leq 32$. Each system serves to store one logical qubit and additional auxiliary dimensions are used to create and purify entanglement between systems. Physical, possibly probabilistic two-system operations with error rates up to 2/3 are still tolerable to realize deterministic high quality two-qubit gates on the logical qubits. The achievable error rate is of the same order of magnitude as of the single-system operations. We investigate possible implementations of our scheme for several physical set-ups.
This paper explains a performance analysis of an error sharing system running on multimedia collaboration in situation-aware middleware using the rule-based SES(System Entity Structure) and DEVS(Discrete Event System Specification) modeling and simulation techniques. In DEVS, a system has a time base, inputs, states, outputs, and functions. This system proposes an adaptive faulttolerance algorithm and its simulation model in situation-aware middleware framework by using DEVS. An example of situation-aware environment is illustrated in multimedia collaboration.
The use of self-checking nodes and links for implementing fault-tolerant VLSI multicomputers is proposed. The system consists of a large number of VLSI computers interconnected by high-speed dedicated links. Hardware which performs error detection is combined with system-level protocols which handle error recovery and fault treatment. The self-checking nodes notify the rest of the system when their output is erroneous. In order to achieve high fault coverage, error detection is accomplished by duplication and matching. The critical circuit in this scheme is a comparator, which must not be susceptible to faults which can remain undetected and later mask the failure of the functional modules. With both NMOS and CMOS technologies it is possible to implement a self-testing comparator which will produce an error indication if the comparator incurs any single physical defect. 13 references.
Advances in automation have provided integration of monitoring and control functions to enhance the operator's overview and ability to take remedy actions when faults occur. Automation in plant supervision is technically possible with integrated automation systems as platforms, but new design methods are needed to cope efficiently with the complexity and to ensure that the functionality of a supervisor is correct and consistent. In particular these methods are expected to significantly improve faulttolerance of the designed systems. The purpose of this work is to develop a software module implementing an automated technique for Failure Modes and Effects Analysis (FMEA). This technique is based on the matrix formulation of FMEA for the investigation of failure propagation through a system. As main result, this technique will provide the design engineer with decision tables for fault handling that show how fault migration can be stopped.
Advances in automation have provided integration of monitoring and control functions to enhance the operator's overview and ability to take remedy actions when faults occur. Automation in plant supervision is technically possible with integrated automation systems as platforms, but new design methods are needed to cope efficiently with the complexity and to ensure that the functionality of a supervisor is correct and consistent. In particular these methods are expected to significantly improve faulttolerance of the designed systems. The purpose of this work is to develop a software module implementing an automated technique for Failure Modes and Effects Analysis (FMEA). This technique is based on the matrix formulation of FMEA for the investigation of failure propagation through a system. As main result, this technique will provide the design engineer with decision tables for fault handling that show how fault migration can be stopped.
The concepts of the continuously reconfiguring flight control system (crm/sup 2/fcs) and the impact of its architecture upon faulttolerance and reliability are covered. Some of the topics discussed are continuous reconfiguration, autonomous control, virtual common memory and the fault filter. Continuous reconfiguration is defined. An example is discussed with an explanation of transparent failure. Autonomous control is the scheme for controlling a continually reconfiguring system. The process of volunteering is also discussed. The virtual common memory is the common memory architecture used in the continuously reconfiguring system. Its physical implementation is explained. The fault filter is the method used to detect and deal with faulty processors. The different levels and the types of faults each handles are examined. 1 ref.
Abstract in english In this paper the behavior of assertion-based error detection mechanisms is characterized under faults injected according to a quite general fault model. Assertions based on the knowledge of the application can be very effective at detecting corruption of critical data caused by hardware faults. The main drawbacks of that approach are identified as being the lack of protection of data outside the section covered by assertions, namely during input and output, and the possi (more) ble incorrect execution of the assertions. To handle those weak-points the Robust Assertions technique is proposed, whose effectiveness is shown by extensive fault injection experiments. With this technique a system follows a new failure model, that is called Fail-Bounded, where with high probability all results produced are either correct or, if wrong, they are within a certain bound of the correct value, whose exact distance depends on the output assertions used. Any kind of assertions can be considered, from simple likelihood tests to high coverage assertions such as those used in the Algorithm Based FaultTolerance paradigm. We claim that this failure model is very useful to describe the behavior of many low-cost fault-tolerantsystems, that have low hardware and software redundancy, like embedded systems, were cost is a severe restriction, yet full availability is expected.
This paper examines the problem of introducing advanced forms of fault-tolerance via reconfiguration into safety-critical avionic systems. This is required to enable increased availability after fault occurrence in distributed integrated avionic systems(compared to static federated systems). The approach taken is to identify a migration path from current architectures to those that incorporate re-configuration to a lesser or greater degree. Other challenges identified include change of the development process; incremental and flexible timing and safety analyses; configurable kernels applicable for safety-critical systems.
As supercomputers become larger and more powerful, they are growing increasingly complex. This is reflected both in the exponentially increasing numbers of components in HPC systems (LLNL is currently installing the 1.6 million core Sequoia system) as well as the wide variety of software and hardware components that a typical system includes. At this scale it becomes infeasible to make each component sufficiently reliable to prevent regular faults somewhere in the system or to account for all possible cross-component interactions. The resulting faults and instability cause HPC applications to crash, perform sub-optimally or even produce erroneous results. As supercomputers continue to approach Exascale performance and full system reliability becomes prohibitively expensive, we will require novel techniques to bridge the gap between the lower reliability provided by hardware systems and users unchanging need for consistent performance and reliable results. Previous research on HPC system reliability has developed various techniques for tolerating and detecting various types of faults. However, these techniques have seen very limited real applicability because of our poor understanding of how real systems are affected by complex faults such as soft fault-induced bit flips or performance degradations. Prior work on such techniques has had very limited practical utility because it has generally focused on analyzing the behavior of entire software/hardware systems both during normal operation and in the face of faults. Because such behaviors are extremely complex, such studies have only produced coarse behavioral models of limited sets of software/hardware system stacks. Since this provides little insight into the many different system stacks and applications used in practice, this work has had little real-world impact. My project addresses this problem by developing a modular methodology to analyze the behavior of applications and systems during both normal and faulty operation. By synthesizing models of individual components into a whole-system behavior models my work is making it possible to automatically understand the behavior of arbitrary real-world systems to enable them to tolerate a wide range of systemfaults. My project is following a multi-pronged research strategy. Section II discusses my work on modeling the behavior of existing applications and systems. Section II.A discusses resilience in the face of soft faults and Section II.B looks at techniques to tolerate performance faults. Finally Section III presents an alternative approach that studies how a system should be designed from the ground up to make resilience natural and easy.
This thesis considered the development of faulttolerant control systems. The focus was on the category of automated processes that do not necessarily comprise a high number of identical sensors and actuators to maintain safe operation, but still have a potential for improving immunity to component failures. It is often feasible to increase availability for these control loops by designing the control system to perform on-line detection and reconfiguration in case of faults before the safety system makes a close-down of the process. A general development methodology is given in the thesis that carried the control system designer through the steps necessary to consider fault handling in an early design phase. It was shown how an existing control loop with interface to the plant wide control system could be extended with three additional modules to obtain faulttolerance: Fault detection and isolation, remedial action decision, and reconfiguration. The integration of these modules in software were considered. The general methodology covered the analysis, design, and implementation of faulttolerant control systems on an overall level. Two detailed studies were presented, one on fault detection and isolation design and one on design of the decision logic. Two application case studies were used to emphasize practical aspects of both the development methodology and the detailed studies. One was an electro-mechanical actuator in a position control loop for a diesel engine speed governor where the purpose was to avoid a total close-down in case of the most likely faults. The second was a faulttolerant attitude control system for a micro satellite where the operation of the system is mission critical. The purpose was to avoid hazardous effects from faults and maintain operation if possible. A method was introduced that, after a systematic examination of possible component failures, enables analysis of the relationship between failures and their consequences for the system's operation. This fault propagation analysis is based on coarse models of the subsystems describing the reaction to faults, as for example a variable being zero, low or high. Examples were given that illustrate how such models can be established by simple means, and yet provide important information when combined into a complete system. A special achievement was a method to determine how control loops behave in case of faults. This is not straight forward as the system behaviour depends on the character of the feedback. One of the detailed studies were the design of the decision logic in fault handling, realized as state-event machines. Guidelines for the design were provided, based on experience from the two case studies. Methods for verifying correct operation of the decision logic were described, where a completeness check against the fault propagation analysis is able to guarantee coverage of all considered faults. The usage of software tools to support the development process was illustrated with an off-the-shelf product for constraint logic solving and state-event machine analysis. The coarse system models and the decision logic were analyzed with the tool-box and it was shown how an easy analysis could be performed to verify correctness and completeness of the fault handling design. Experience from this study highlights requirements for a dedicated software environment for faulttolerant control systems design. The second detailed study addressed the detection of a fault event and determination of the failed component. A variety of algorithms were compared, based on two fault scenarios in the speed governor actuator setup. One was a position sensor fault and the second was an actuator current fault. The sensor fault detection was trivial, whereas the actuator fault was more challenging. The study demonstrated that many existing methods have a potential to detect and isolate the two faults, but also that the research field still misses a systematic approach to handle realistic problems such as low sampling rate and nonlinear characteristics of the system. The thesis contributed with methods to detect both faults and specifically with a novel algorithm for the actuator fault detection that is superior in terms of performance and complexity to the other algorithms in the comparative study.
A failure modes study was the basis for establishing the protection system requirements and for identifying possible design alternatives or options that may be considered. Components of the protection system are discussed with respect to the redundancy and monitoring needed to assure high reliability, as measured by the probability that the system will respond when needed but will not respond inadvertently. High reliability is achieved by making the systemfaulttolerant using redundancy, decision logic, and computer monitoring.
This progress report describes research towards the design and construction of embedded operating systems for real-time advanced aerospace applications. The applications concerned require reliable operating system support that must accommodate networks of computers. The report addresses the problems of constructing such operating systems, the communications media, reconfiguration, consistency and recovery in a distributed system, and the issues of realtime processing. A discussion is included on suitable theoretical foundations for the use of atomic actions to support faulttolerance and data consistency in real-time object-based systems. In particular, this report addresses: atomic actions, faulttolerance, operating system structure, program development, reliability and availability, and networking issues. This document reports the status of various experiments designed and conducted to investigate embedded operating system design issues.
High performance and reliability are required for wind turbines to be competitive within the energy market. To capture their nonlinear behavior, wind turbines are often modeled using parameter-varying models. In this paper we design and compare multiple linear parameter-varying (LPV) controllers, designed using a proposed method that allows the inclusion of both faults and uncertainties in the LPV controller design. We specifically consider a 4.8 MW, variable-speed, variable-pitch wind turbine model with a fault in the pitch system.We propose the design of a nominal controller (NC), handling the parameter variations along the nominal operating trajectory caused by nonlinear aerodynamics. To accommodate the fault in the pitch system, an active fault-tolerant controller (AFTC) and a passive ...
"The fundamental objective of the combined safety and Reliability assessment is to identify critical items in the design and the choice of equipment that may jeopardize safety or availability, and thereby to provide arguments for the selection between different options for the system." Achieving safety and reliability has been one the prime objectives for system designers while designing safety critical system for decades. With growing environmental awareness, concerns, and demands, the scope of the design of reliable (and safe) systems has been enhanced to even small components as sensors and actuators. In the past, the normal procedure to address the higher demand for reliability was to add hardware redundancy that in turn increases the production and maintenance costs. Active fault-tolerant design is an attempt to achieve higher redundancy while minimizing the costs. In chapter 2 reliability and safety related issues are considered and described. The idea of introducing this chapter is to provide an overview of the concepts and methods used for reliability and safety assessment. The focus in chapter 3 is on fault-tolerance concept. Type of possible faults in components and customary methods for applying redundancy is described. Finally, the chapter is wrapped up by considering and describing the main subject, which is a formal and consistent procedure to design active fault-tolerantsystems
Twisted hypercube-like networks (THLNs) are an important class of interconnection networks for parallel computing systems, which include most popular variants of the hypercubes, such as crossed cubes, M\\"obius cubes, twisted cubes and locally twisted cubes. This paper deals with the fault-tolerant hamiltonian connectivity of THLNs under the large fault model. Let $G$ be an $n$-dimensional THLN and $F \\subseteq V(G)\\bigcup E(G)$, where $n \\geq 7$ and $|F| \\leq 2n - 10$. We prove that for any two nodes $u,v \\in V(G - F)$ satisfying a simple necessary condition on neighbors of $u$ and $v$, there exists a hamiltonian or near-hamiltonian path between $u$ and $v$ in $G-F$. The result extends further the fault-tolerant graph embedding capability of THLNs.
Described here is the Army FaultTolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA FaultTolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions.
Research has presented several approaches to achieve varying degrees of fault-tolerance in unmanned aircraft. Approaches in reconfigurable flight control are generally divided into two categories: those which incorporate multiple non-adaptive controllers and switch between them based on the output of a fault detection and identification element, and those that employ a single adaptive controller capable of compensating for a variety of fault modes. Regardless of the approach for reconfigurable flight control, certain fault modes dictate system restructuring in order to prevent a catastrophic failure. System restructuring enables active control of actuation not employed by the nominal system to recover controllability of the aircraft. After system restructuring, continued operation requires the generation of flight paths that adhere to an altered flight envelope. The control architecture developed in this research employs a multi-tiered hierarchy to allow unmanned aircraft to generate and track safe flight paths despite the occurrence of potentially catastrophic faults. The hierarchical architecture increases the level of autonomy of the system by integrating five functionalities with the baseline system: fault detection and identification, active system restructuring, reconfigurable flight control; reconfigurable path planning, and mission adaptation. Fault detection and identification algorithms continually monitor aircraft performance and issue fault declarations. When the severity of a fault exceeds the capability of the baseline flight controller, active system restructuring expands the controllability of the aircraft using unconventional control strategies not exploited by the baseline controller. Each of the reconfigurable flight controllers and the baseline controller employ a proven adaptive neural network control strategy. A reconfigurable path planner employs an adaptive model of the vehicle to re-shape the desired flight path. Generation of the revised flight path is posed as a linear program constrained by the response of the degraded system. Finally, a mission adaptation component estimates limitations on the closed-loop performance of the aircraft and adjusts the aircraft mission accordingly. A combination of simulation and flight test results using two unmanned helicopters validates the utility of the hierarchical architecture.
Systems are built by connecting different components (e.g., sensors, actuators, process components) that are, in turn, organized to achieve system objectives. But, when a system component fails, the system's objectives can no longer be achieved. For many years, numerous studies have proposed efficient fault detection and isolation (FDI) and fault-tolerant control (FTC) algorithms. This paper considers faults that lead to the complete failure of actuators. In this specific case, the system's physical structure changes, and the system model thus becomes incorrect. The potential that the system has to continue to achieve its objectives has to be re-evaluated from a qualitative point of view, before recalculating or modifying the control algorithms. To this end, this paper proposes a self-upda...
Whereas some applications require correct computation many others do not. A large domain where perfect functional performance is not always required is multimedia and DSP systems. Relaxing the requirement of 100% correctness for devices and interconnections may dramatically reduce costs of manufacturing, verification, and testing. The goal of this paper is to develop a method for trading computational correctness for an additional chip area involved by fault-tolerance implementation. The method is demonstrated for the BP array in the following way: only the most significant bits of the output word are made fault-tolerant. By introducing the concept of partially error-tolerant BP array, designers achieve one more degree of tradeoff freedom. Formal definitions of the proposed terms are given...
We demonstrate the possibility of realizing a neural network in a chain of trapped ions with induced long range interactions. Such models permit to store information distributed over the whole system. The storage capacity of such network, which depends on the phonon spectrum of the system, can be controlled by changing the external trapping potential. We analyze the implementation of fault-tolerant universal quantum information processing in such systems.
This paper describes a parallel algorithm for k-valued fault-tolerant diagnostic tests construction. The algorithm is implemented in a intelligent instrumental software IMSLOG. Fault-tolerance is ensured by applying a sufficient condition for the construction of diagnostic tests that are tolerant to a given number of errors of measurement (entry) values of the characteristic features of the object.
A system for data compression utilizing systolic array architecture for Vector Quantization (VQ) is disclosed for both full-searched and tree-searched. For a tree-searched VQ, the special case of a Binary Tree-Search VQ (BTSVQ) is disclosed with identical Processing Elements (PE) in the array for both a Raw-Codebook VQ (RCVQ) and a Difference-Codebook VQ (DCVQ) algorithm. A faulttolerantsystem is disclosed which allows a PE that has developed a fault to be bypassed in the array and replaced by a spare at the end of the array, with codebook memory assignment shifted one PE past the faulty PE of the array.
Research on wind turbine Operations & Maintenance (O&M) procedures is critical to the expansion of Wind Energy Conversion systems (WEC). In order to reduce O&M costs and increase the lifespan of the turbine, we study the application of Set-Valued Observers (SVO) to the problem of Fault Detection and Isolation (FDI) and FaultTolerant Control (FTC) of wind turbines, by taking advantage of the recent advances in SVO theory for model invalidation. A simple wind turbine model is presented along with possible faulty scenarios. The FDI algorithm is built on top of the described model, taking into account process disturbances, uncertainty and sensor noise. The FTC strategy takes advantage of the proposed FDI algorithm, enabling the controller reconfiguration shortly after fault events. Additionally, a robust controller is designed so as to increase the wind turbine's performance during low severity faults. Finally, the FDI algorithm is assessed within a publicly available benchmark model, using Monte-Carlo simulation runs.
This paper analyzes the fault-tolerance nature of Evolutionary Algorithms (EAs) when executed in a distributed environment subjected to malicious acts. More precisely, the inherent resilience of EAs against two types of failures is considered: (1) crash faults, typically due to resource volatility which lead to data loss and part of the computation loss; (2) cheating faults, a far more complex kind of fault that can be modeled as the alteration of output values produced by some or all tasks of the program being executed. This last type of failure is due to the presence of cheaters on the computing platform. Most often in Global Computing (GC) systems such as BOINC, cheaters are attracted by the various incentives provided to stimulate the volunteers to share their computing resources: chea...
It is anticipated that self assembled ultra-dense nanomemories will be more susceptible to manufacturing defects and transient faults than conventional CMOS-based memories, thus the need exists for fault-tolerant memory architectures. The development of such architectures will require intense analysis in terms of achievable performance measures - power dissipation, area, delay and reliability. In this paper, we propose and develop a hybrid automation framework, called HMAN, that aids the design and analysis of fault-tolerant architectures for nanomemories. Our framework can analyze memory architectures at two different levels of the design abstraction, namely the system and circuit levels. To the best of our knowledge, this is the first such attempt at analyzing memory systems at different levels of abstraction and then correlating the different performance measures to provide the system designers guidelines for designing a robust nanomemory. We also illustrate the application of our framework to self-assembled crossbar architectures by analyzing a hierarchical fault-tolerant crossbar-based memory architecture that we have developed, and comparing this with existing crossbar architectures.
This paper presents a systematic procedure to achieve faulttolerant capability for a four-wheel driven, four-wheel steered mobile robot moving in outdoor terrain. The procedure is exemplified through the paper by applying on a compass module. Detailed methods for fault detection and fault accommodation for the compass faults are discussed and the results are verified through field tests.
A logic and proof theory that was mechanized and successfully applied to prove nontrivial properties of a fully distributed faulttolerantsystem is described. The system is close to achieving the critical balance in man machine interaction necesary for successful application by users other than the system developers. Motivation for the form of man machine interaction embodied by STP is discussed. A formal description of the logic and the proof theory is contained, and the description illustrated with an example. The use of STP in a large scale effort to prove nontrivial properties on SIFT, a distributed Software Implemented FaultTolerant operating system for aircraft flight control is discussed. Finally, directions for further work are described.
Abstract in english Building dependable distributed applications is not an easy task. Designers of such systems have followed two complementary approaches to reduce design complexity, namely: i) the use of appropriate developing tools; and ii) the choice of the most restrictive failure semantics possible for the components that form the system?s underlying execution layer. The Seljuk model uses these two approaches to specify a structured way of providing faulttolerance services in the con (more) text of distributed operating environments, thus facilitating the construction and execution of dependable distributed applications. In this paper we present the design of the Seljuk-Amoeba operating environment, which follows the Seljuk model to enhance the Amoeba distributed operating system with the provision of faulttolerance services
This paper adresses the design process of diagnosis and fault-tolerant control when the a system should operate despite multiple failures in sensors or actuators. Graph-teory based analysis of systems structure is demonstrated to be a unique design methodology that can cope with the diagnosis design for systems of high complexity, and also analyse the cases of cascaded or multiple faults. The paper takes as example a ship with two CP propellers, rudders and a bow thruster as actuators, and instrumentation with a suite of global position sensors, inertial navigation units and conventional gyro units to provide ship motion information. A salient feature of the design mehod is the ability to analyse cases where faults have occurrred and easily determine where in the faulty system diagnosability and controlability are retained.
With ever growing concerns on energy crises and environmental issues, Proton Exchange Membrane Fuel Cell is favored in automotive applications because it is clean, efficient and low noise. A fuel cell hybrid powertrain is composed of a fuel cell system as the primary power source, a battery as the secondary power source and an electric motor. In order to improve the reliability, a distributed control system that can be preserved even under faulty cases is necessary. This paper presents an active faulttolerance control system (AFTCS) for a fuel cell/battery hybrid powertrain applied to a city bus. The AFTCS consists of a system for fault detection and diagnose and a reconfigurable controller. Algorithms to detect and isolated three kinds of important faults are introduced. The real-time ap...
This paper addresses the design process of diagnosis and fault-tolerant control when a system should operate despite multiple failures in sensors or actuators. Graph-theory based analysis of system's structure is demonstrated to be a unique design methodology that can cope with the diagnosis design for systems of high complexity, and also analyse the cases of cascaded or multiple faults. The paper demonstrates the design method on a ship with three actuators: two shafts with CP propellers and a bow thruster, and navigation instruments: global position sensors (GPS), inertial navigation units and conventional gyros to provide ship motion information. A salient feature of the design method is shown to be the ability to analyse cases where one or more faults have occurred and rapidly determine where in the faulty system reconfigurability, diagnosability and controllability are retained.
We show that a category of one-dimensional XY-type models may enable high-fidelity quantum state transmissions, regardless of details of coupling configurations. This observation leads to a fault- tolerant design of a state transmission setup. The setup is fault-tolerant, with specified thresholds, against engineering failures of coupling configurations, fabrication imperfections or defects, and even time-dependent noises. We propose the implementation of the fault-tolerant scheme using hard-core bosons in one-dimensional optical lattices.
In this introductory article on the subject of quantum error correction and fault-tolerant quantum computation, we review three important ingredients that enter known constructions for fault-tolerant quantum computation, namely quantum codes, error discretization and transversal quantum gates. Taken together, they provide a ground on which the theory of quantum error correction can be developed and fault-tolerant quantum information protocols can be built. PMID:22908341
This book presents an overview of the issues related to the test, diagnosis and fault-tolerance of Network on Chip-based systems. It is the first book dedicated to the quality aspects of NoC-based systems and will serve as an invaluable reference to the problems, challenges, solutions, and trade-offs related to designing and implementing state-of-the-art, on-chip communication architectures.
New services based on the best-effort paradigm could complement the current deterministic services of an electronic financial exchange. Four crucial aspects of such systems would benefit from a hybrid stance: proper use of processing resources, bandwidth management, faulttolerance, and exception handling. We argue that a more refined view on Quality-of-Service control for exchange systems, in which the principal ambition of upholding a fair and orderly marketplace is left uncompromised, would benefit all interested parties.
Cryoelectric device, array, and assembled system considerations are examined in the context of a postulated one million cell (one billion gate) association storing processor (ASP) application. Cryotron and STD (superconducting tunneling device based on the dc Josephson effect) devices are compared. Arguments based on practical array complexity, interconnection, and the necessary faulttolerant provisions are cited. Prerequisites to subsequent development of high complexity cryoelectric systems are apparent.
An electromechanical ac-powered rotary actuated four-bar linkage system for rotating the Shuttle/Centaur deployment adapter is described. The essential features of the deployment adapter rotation system (DARS) are increased reliability for mission success and maximum practical hazard control for safety. The requirements, concept development, hardware configuration, quality assurance provisions, and techniques used to meet two-faulttolerance requirements are highlighted. The rationale used to achieve a degree of safety equivalent of that of two-failure tolerance is presented. Conditions that make this approach acceptable, including single failure point components with regard to redundancy versus credibility of failure modes, are also discussed.
The Data Managment System network is a complex and important part of manned space platforms. ... include fault detection, fault diagnosis, fault recovery, fault prevention and ... case, the AI subsystem is an advisor to a network designer .
The matrix converter (MC) presents a promising topology that will have to overcome certain barriers (protection systems, durability, the development of converters for real applications, etc.) in order to gain a foothold in the industry. In some applications, where continuous operation must be insured in the case of a system failure, improved reliability of the converter is of particular importance. In this sense, this article focuses on the study of a faulttolerant MC. The faulttolerance of a converter is characterized by its total or partial response in the case of a breakage of any of its components. Taking into consideration that virtually no work has been done on faulttolerant MCs, this paper describes the most important studies in this area. Moreover, a new method is proposed for detecting the breakage of MC semiconductors. Likewise, a new variation of SVM modulation with failure tolerance capacity is presented. This guarantees the continuous operation of the converter and the pseudo-optimum control of a PMSM. This paper also proposes a novel MC topology, which allows the flexible reconfiguration of this converter, when one or several of its semiconductors are damaged. In this way, the MC can continue operating at 100% of its performance without having to double its resources. In this way, it can be said that the solution described in this article represents a step forward towards the development of reliable matrix converters for real applications. (author)
The advancement of information processing into the realm of quantum mechanics promises a transcendence in computational power that will enable problems to be solved which are completely beyond the known abilities of any "classical" computer, including any potential non-quantum technologies the future may bring. However, the fragility of quantum states poses a challenging obstacle for realization of a fault-tolerant quantum computer. The topological approach to quantum computation proposes to surmount this obstacle by using special physical systems -- non-Abelian topologically ordered phases of matter -- that would provide intrinsic fault-tolerance at the hardware level. The so-called "Ising-type" non-Abelian topological order is likely to be physically realized in a number of systems, but it can only provide a universal gate set (a requisite for quantum computation) if one has the ability to perform certain dynamical topology-changing operations on the system. Until now, practical methods of implementing thes...
In this paper we will present a new technology that we are currently developing within the SFT: Scalable FaultTolerance FastOS project which seeks to implement faulttolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5?s; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.
The Legion project at the University of Virginia is an architecture for designing and building system services that provide the illusion of a single virtual machine to users, a virtual machine that provides secure shared object and shared name spaces, application adjustable fault-tolerance, improved response time, and greater throughput. Legion targets wide area assemblies of workstations, supercomputers, and parallel supercomputers, Legion tackles problems not solved by existing workstation based parallel processing tools; the system will enable fault-tolerance, wide area parallel processing, inter-operability, heterogeneity, a single global name space, protection, security, efficient scheduling, and comprehensive resource management. This paper describes the core Legion object model, which specifies the composition and functionality of Legion`s core objects-those objects that cooperate to create, locate, manage, and remove objects in the Legion system. The object model facilitates a flexible extensible implementation, provides a single global name space, grants site autonomy to participating organizations, and scales to millions of sites and trillions of objects.
In this paper, we explore the benefits and possibilities about the implementation of multi-agents simulation framework on a Hadoop cloud. Scalability, fault-tolerance and failure-recovery have always been a challenge for a distributed systems application developer. The highly efficient faulttolerant nature of Hadoop, flexibility to include more systems on the fly, efficient load balancing and the platform-independent Java are useful features for development of any distributed simulation. In this paper, we propose a framework for agent simulation environment built on Hadoop cloud. Specifically, we show how agents are represented, how agents do their computation and communication, and how agents are mapped to datanodes. Further, we demonstrate that even if some of the systems fail in the di...
Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. Development will take place in four phases: Phase I results in a solid infrastructure; Phase II produces a functional but limited interactive job initiation capability without use of the interconnect/switch; Phase III provides switch support and documentation; Phase IV provides job status, fault-tolerance, and job queuing and control through Livermore's Distributed Production Control System (DPCS), a meta-batch and resource management system.
Grid superscheduling requires support for efficient and scalable discovery of resources. Resource discovery activities involve searching for the appropriate resource types that match the user's job requirements. To accomplish this goal, a resource discovery system that supports the desired look-up operation is mandatory. Various kinds of solutions to this problem have been suggested, including the centralised and hierarchical information server approach. However, both of these approaches have serious limitations in regards to scalability, fault-tolerance and network congestion. To overcome these limitations, organising resource information using Peer-to-Peer (P2P) network model has been proposed. Existing approaches advocate an extension to structured P2P protocols, to support the Grid resource information system (GRIS). In this paper, we identify issues related to the design of such an efficient, scalable, fault-tolerant, consistent and practical GRIS system using a P2P network model. We compile these issues...
The Low Latency FaultTolerance (LLFT) system provides faulttolerance for distributed applications, using the leader-follower replication technique. The LLFT system provides application-transparent replication, with strong replica consistency, for applications that involve multiple interacting processes or threads. The LLFT system comprises a Low Latency Messaging Protocol, a Leader-Determined Membership Protocol, and a Virtual Determinizer Framework. The Low Latency Messaging Protocol provides reliable, totally ordered message delivery by employing a direct group-to-group multicast, where the ordering is determined by the primary replica in the group. The Leader-Determined Membership Protocol provides reconfiguration and recovery when a replica becomes faulty and when a replica joins or leaves a group, where the membership is determined by the primary replica. The Virtual Determinizer Framework captures the ordering information at the primary replica in the group and enforces the same ordering at the backup...
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as faulttolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant faulttolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.
Taichung Veterans General Hospital has been developing a hospital-wide picture archiving and communication system (PACS) since 1993. A personal computer-based environment was implemented to reduce costs (only $2,500 for each view station) and take advantage of distributed system techniques. Other features of the PACS are automatic image acquisition, hierarchic storage management, efficient image transmission, robust faulttolerance, and user-friendly image manipulation. The system is integrated with the hospital information system so that Chinese-language patient data can be automatically transferred. A four-tier storage hierarchy and a multipath search strategy are used to improve reliability and efficiency. Image compression and efficient image transmission techniques (autorouting and prefetching) are used to reduce the response time. Robust faulttolerance is achieved with fault-tolerant hardware, image replication, and a system watchdog. User-friendly image manipulation features include easy adjustment of the brightness, contrast, or quality of the displayed image; several windows for image display; and image measurement capability. The PACS currently supports computed tomography, ultrasound, magnetic resonance imaging, computed radiography, and digital fluoroscopy; almost all appropriate personal computers in the hospital can be used as view stations. Users are satisfied with the quality, reliability, and performance of the system. PMID:10194793
Autonomous multiple spacecraft formation flying space missions demand the development of reliable control systems to ensure rapid, accurate, and effective response to various attitude and formation reconfiguration commands. Keeping in mind the complexities involved in the technology development to enable spacecraft formation flying, this thesis presents the development and validation of a faulttolerant control algorithm that augments the AOCS on-board a spacecraft to ensure that these challenging formation flying missions will fly successfully. Taking inspiration from the existing theory of nonlinear control, a fault-tolerant control system for the RyePicoSat missions is designed to cope with actuator faults whilst maintaining the desirable degree of overall stability and performance. Autonomous faulttolerant adaptive control scheme for spacecraft equipped with redundant actuators and robust control of spacecraft in underactuated configuration, represent the two central themes of this thesis. The developed algorithms are validated using a hardware-in-the-loop simulation. A reaction wheel testbed is used to validate the proposed faulttolerant attitude control scheme. A spacecraft formation flying experimental testbed is used to verify the performance of the proposed robust control scheme for underactuated spacecraft configurations. The proposed underactuated formation flying concept leads to more than 60% savings in fuel consumption when compared to a fully actuated spacecraft formation configuration. We also developed a novel attitude control methodology that requires only a single thruster to stabilize three axis attitude and angular velocity components of a spacecraft. Numerical simulations and hardware-in-the-loop experimental results along with rigorous analytical stability analysis shows that the proposed methodology will greatly enhance the reliability of the spacecraft, while allowing for potentially significant overall mission cost reduction.
The position sensors used in a magnetic bearing system are desirable to provide some degree of fault-tolerance as the rotor position is necessary for the feedback control to overcome the open-loop instability. In this paper, we propose and inductive position sensor that can cope with a partial fault in the sensor. The sensor has multiple poles which can be combined to sense the in-plane motion of the rotor. When a high-frequency voltage signal drives each pole of the sensor, the resulting current in the sensor coil contains information regarding the rotor position. The signal processing circuit of the sensor extracts this position information. In this paper, we used the magnetic circuit model of the sensor that shows the analytical relationship between the sensor output and the rotor motion. The multi-polar structure of the sensor makes it possible to introduce redundancy which can be exploited for fault-tolerant operation. The proposed sensor is applied to a magnetically levitated turbo-molecular vacuum pump. Experimental results validate the fault-tolerance algorithm.
The NASA Glenn Research Center, partner universities, and defense contractors are working to develop intelligent power management and distribution (PMAD) technologies for future spacecraft and launch vehicles. The goals are to provide higher performance (efficiency, transient response, and stability), higher faulttolerance, and higher reliability through the application of digital control and communication technologies. It is also expected that these technologies will eventually reduce the design, development, manufacturing, and integration costs for large, electrical power systems for space vehicles. The main focus of this research has been to incorporate digital control, communications, and intelligent algorithms into power electronic devices such as direct-current to direct-current (dc-dc) converters and protective switchgear. These technologies, in turn, will enable revolutionary changes in the way electrical power systems are designed, developed, configured, and integrated in aerospace vehicles and satellites. Initial successes in integrating modern, digital controllers have proven that transient response performance can be improved using advanced nonlinear control algorithms. One technology being developed includes the detection of "soft faults," those not typically covered by current systems in use today. Soft faults include arcing faults, corona discharge faults, and undetected leakage currents. Using digital control and advanced signal analysis algorithms, we have shown that it is possible to reliably detect arcing faults in high-voltage dc power distribution systems (see the preceding photograph). Another research effort has shown that low-level leakage faults and cable degradation can be detected by analyzing power system parameters over time. This additional fault detection capability will result in higher reliability for long-lived power systems such as reusable launch vehicles and space exploration missions.
Latency, faulttolerance and reliability are important requirements for several applications that are time critical in nature: such applications require guarantees in terms of latency, even when processors are subject to failures. In this paper, we propose a fault-tolerant scheduling heuristic for mapping precedence task graphs on heterogeneous systems. Our approach is based on an active replication scheme, capable of supporting ? arbitrary fail-silent/fail-stop processor failures, and hence valid results will be provided even if ? processors fail. First we focus on a bi-criteria approach, where we aim at minimizing the latency given a fixed number of failures supported in the system, or the other way round. Next we derive a more complex algorithm in which we not only minimize latency and ...
This paper discusses on-going work with the Integrated Plasma Simulator (IPS), a framework for coupled multiphysics simulations of plasmas, to allow simulations to run through the loss of nodes on which the simulation is executing. While many different techniques are available to improve the faulttolerance of computational science applications on high-performance computer systems, checkpoint/restart (C/R) remains virtually the only one that see widespread use in practice. Our focus here is to augment the traditional C/R approach with additional techniques that can provide a more localized and tailored response to faults based on the ability to restart failed tasks on an individual basis, and the use of information external to the application itself in order to guide decision-making, in many cases avoiding the need to stop and restart the entire simulation. This capability involves several features within the IPS framework, and leverages the FaultTolerance Backplane, a publish/subscribe event service to disseminate fault-related information throughout HPC systems, to obtain information from the Reliability, Availability and Serviceability (RAS) subsystem of the HPC system. This work is described in the context of Cray XT-series computer systems for concreteness, but is applicable to other environments as well. As part of the analysis of this work, we discuss the requirements to generalize this approach to other complex simulation applications beyond the Integrated Plasma Simulator.
Faults in wiring systems are a serious concern for the aerospace and aeronautic (commercial, military, and civilian) industries. Circuit failures and vehicle accidents have occurred and have been attributed to faulty wiring created by open and/or short circuits. Often, such circuit failures occur due to vibration during vehicle launch or operation. Therefore, developing non-intrusive fault-tolerant techniques is necessary to detect circuit faults and automatically route signals through alternate recovery paths while the vehicle or lunar surface systems equipment is in operation. Electrical connector concepts combining dust mitigation strategies and cable diagnostic technologies have significant application for lunar and Martian surface systems, as well as for dusty terrestrial applications. The dust-tolerant intelligent electrical connection system has several novel concepts and unique features. It combines intelligent cable diagnostics (health monitoring) and automatic circuit routing capabilities into a dust-tolerant electrical umbilical. It retrofits a clamshell protective dust cover to an existing connector for reduced gravity operation, and features a universal connector housing with three styles of dust protection: inverted cap, rotating cap, and clamshell. It uses a self-healing membrane as a dust barrier for electrical connectors where required, while also combining lotus leaf technology for applications where a dust-resistant coating providing low surface tension is needed to mitigate Van der Waals forces, thereby disallowing dust particle adhesion to connector surfaces. It also permits using a ruggedized iris mechanism with an embedded electrodynamic dust shield as a dust barrier for electrical connectors where required.
This paper describe a concept for faulttolerant controllers (FTC) based on the YJBK (after Youla, Jabr, Bongiorno and Kucera) parameterization. This controller architecture will allow to change the controller on-line in the case of faults in the system. In the described FTC concept, a safe mode controller is applied as the basic feedback controller. A controller for normal operation with high performance is obtained by including certain YJBK parameters (transfer functions) in the controller. This will allow a fast switch from normal operation to safe mode operation in case of critical faults in the system. The described FTC architecture allow the different feedback controllers to apply different sets of sensors and actuators.
Performing dependability evaluation along with other analyses at architectural level allows both making architectural tradeoffs and predicting the effects of architectural decisions on the dependability of an application. This paper gives guidelines for building architectural dependability models for software systems using the AADL (Architecture Analysis and Design Language). It presents reusable modeling patterns for fault-tolerant applications and shows how the presented patterns can be used in the context of a subsystem of a real-life application.
We present the Telecommunications protocol processing subsystem using Reconfigurable Interoperable Gate Arrays (TRIGA), a novel approach that unifies faulttolerance, error correction coding and interplanetary communication protocol off-loading to implement CCSDS File Delivery Protocol and Datalink layers. The new reconfigurable architecture offers more than one order of magnitude throughput increase while reducing footprint requirements in memory, command and data handling processor utilization, communication system interconnects and power consumption.
This paper describes the application of modeling and analysis techniques to software that is designed to execute on four channel version of the the Charles Stark Draper Laboratory (CSDL) Fault-Tolerant Processor, referred to as the Draper FTP. The software performs sensor validation of four independent measures (singlas) from the primary pumps of the Experimental Breeder Reactor-II operated by Argonne National Laboratory-West, and from the validated signals formulates a flow trip signal for the reactor safety system. 11 refs., 4 figs.
We strictly study geometric phase gates driven by a classical fluctuating magnetic field. In the derivation of the stochastic Liouville equation of this system, we turn to the cumulant expansion method of Kubo and white noise assumption. The directional dependence of Berry and Aharonov-Anandan (AA) phase gates on noise is assuredly captured. For nonadiabatic evolution, enhancing angular frequency of the rotating magnetic field can be beneficial to realize fault-tolerant AA phase gates.
Scalable quantum computation in realistic devices requires that precise control can be implemented efficiently in the presence of decoherence and operational errors. We propose a constructive procedure for designing robust unitary gates on an open quantum system without encoding or measurement overhead. Our results allow for a low-level error correction strategy solely based on Hamiltonian control under realistic constraints and may substantially reduce implementation requirements for fault-tolerant quantum computing architectures.
Under a U.S. Department of Energy program for radioisotope power systems, Lockheed Martin is developing an Engineering Unit of the Advanced Stirling Radioisotope Generator (ASRG). This is an advanced version of the previously reported SRG110 generator. The ASRG uses Advanced Stirling Convertors (ASCs) developed by Sunpower Incorporated under a NASA Research Announcement contract. The ASRG makes use of a Stirling controller based on power electronics that eliminates the tuning capacitors. The power electronics controller synchronizes dual-opposed convertors and maintains a fixed frequency operating point. The controller is single-faulttolerant and uses high-frequency pulse width modulation to create the sinusoidal currents that are nearly in phase with the piston velocity, eliminating the need for large series tuning capacitors. Sunpower supports this effort through an extension of their controller development intended for other applications. Glenn Research Center (GRC) supports this effort through system dynamic modeling, analysis and test support. The ASRG design arrived at a new baseline based on a system-level trade study and extensive feedback from mission planners on the necessity of single-faulttolerance. This paper presents the baseline design with an emphasis on the power electronics controller detailed design concept that will meet space mission requirements including single faulttolerance.
Cost drivers in commercial orchards are time-consuming tasks as the drive through rows for spraying, cutting grass or collecting fruit. An automated tractor can be an answer to enhance production efficiency. For this to be acceptable by public and authorities, safety and reliability are crucial, hence information redundancy is needed to achieve a faulttolerantsystem. This paper addresses ways to extract information from laser scanner data. A Gaussian Mixture model is used to classify laser data into obstacles, while through diagnosis, a stochastic automaton model gives a semantic position estimate relying only on laser perception. Results demonstrate the feasibility of implementation in an autonomous tractor that use diagnosis and active fault-tolerant control to enhance availability and safety
The study of a comparative analysis of distinct multiplex and fault-tolerant configurations for a PLC-based safety system from a reliability point of view is presented. It considers simplex, duplex and fault-tolerant triple redundancy configurations. The standby unit in case of a duplex configuration has a failure rate which is k times the failure rate of the standby unit, the value of k varying from 0 to 1. For distinct values of MTTR and MTTF of the main unit, MTBF and availability for these configurations are calculated. The effect of duplexing only the PLC module or only the sensors and the actuators module, on the MTBF of the configuration, is also presented. The results are summarized and merits and demerits of various configurations under distinct environments are discussed.
The James Webb Space Telescope (JWST) telescope's secondary mirror and eighteen primary mirror segments are each actively controlled in rigid body position via six hexapod actuators. The mirrors are stowed to the mirror support structure to survive the launch environment and then must be deployed 12.5 mm to reach the nominally deployed position before the Wavefront Sensing & Control (WFS&C) alignment and phasing process begins. The actuation system is electrically, but not mechanically redundant. Therefore, with the large number of hexapod actuators, the faulttolerance of the OTE architecture and WFS&C alignment process has been carefully considered. The details of the faulttolerance will be discussed, including motor life budgeting, failure signatures, and motor life.
The authors overview, characterize, and classify some typical reconfiguration schemes in light of a proposed taxonomy. This taxonomy can be used as a guide for future research in design and analysis of reconfiguration schemes. Studying how to evaluate fault-tolerant arrays and how to exploit application characteristics to achieve dependable computing are important complementary directions of research towards reliable processor-array design. A related research problem is that of functional reconfiguration, that is, learning how to configure the topology of a parallel system to implement a different function or run a different application. Important directions of research include how to apply or extend processor-array reconfiguration algorithms to other topologies and how to marry functional and fault-tolerance reconfiguration requirements and solutions. The Diogenes approach discussed in this article is a case where this goal is naturally achieved.
A fault-tolerant programmable logic controller (PLC) and operator workstations have been programmed to replace the hard-wired relay control system in the 2-MW Bulk Shielding Reactor (BSR) at Oak Ridge National Laboratory. In addition to the PLC and remote and local operator workstations, auxiliary systems for remote operation include a video system, an intercom system, and a fiber optic communication system. The remote control station, located at the High Flux Isotope Reactor 2.5 km from the BSR, has the capability of rector startup and power control. The system was designed with reliability and fail-safe features as important considerations. 4 refs., 3 figs.
The two LHC beam dump kicker systems consist each of 14 pulse generator and magnet subsystems. Their task is to extract on request the beams in synchronisation with the gap in the beam. This operation must be fail-safe to avoid disastrous consequences due to loss of the beam inside the LHC. Only a failing operation of one of the 14 pulse generators is allowed. To preserve this tolerance premature beam dumps are forced immediately after early detection of internal faults. However, these faults should occur rarely in order not to be a source of undesirable downtime of the LHC. The report determines first the level of reliability required for the main components of the system. In particular faults which could cause spontaneously non-synchronised beam dumps are identified. Then, technical solutions are evaluated on failure behaviour. Those having a most likely failure mode which does not cause dump triggers are favoured. These solutions need redundancy and are more complex but have the advantage to be faulttoler...
Precise and reliable navigation is crucial, and for reasons of safety, essential navigation instruments are often duplicated. Hardware redundancy is mostly used to manually switch between instruments should faults occur. In contrast, diagnostic methods are available that can use analytic redundancy to diagnose faults and autonomously provide valid navigation data, disregarding any faulty sensor data and use sensor fusion to obtain a best estimate for users. This paper discusses how diagnostic and fault-tolerant methods are applicable in marine systems. An example chosen is sensor fusion for navigation. Diagnosis design is based on parity relations and statistical hypothesis tests. Sensor fusion on healthy signals is made using a Kalman filter with inverse covariance updating to deal with asynchronous or missing data from instruments. The paper is presented at a tutorial level.
Resolver sensor based angular position and speed sensing are extensively used in safety critical servo applications that demands accurate as well as high-resolution position and speed information for feedback control. In this paper, a novel scheme for position and speed sensing along with fault detection and identifications of a resolver sensor with systematic errors like magnitude imbalance, imperfect quadrature, and inductive harmonics is presented. The proposed scheme of resolver-to-digital (R/D) conversion mitigates the errors in position and speed estimate due to these common resolver imperfections and provides fault indicators such as good resolver signal, degradation of signal, and loss of signal for faulttolerant operation and diagnosis of malfunctions in the sensor system for saf...
This paper presents a fault-tolerant duplex architecture to build a high-reliability microcontroller using commercial VLSI processors. The architecture supports fail-silence under all single-failure situations and facilitates recovery from transient failures. The paper implements the duplex architecture using two Motorola MC68360 processors and evaluates its faulttolerance in a real application environment. (author). 12 refs., 10 figs., 2 tabs.
In wormhole meshes, many routing algorithms prevent deadlocks by enclosing faulty nodes within faulty blocks. None of them however can tolerate the convex fault model without virtual channels. We propose a deterministic fault-tolerant wormhole routing algorithm for mesh networks that can handle disj...
The joint parameter identification and state estimation technique is applied to develop a fault-tolerant space robot system. The potential faults in the considered system are abrupt parametric faults, which indicate that some system parameters will immediately deviate from their nominal values if a fault happens. The concerned system parameters consist of deterministic parts as well as those describing the stochastic features in the system. Due to the purpose for design of reconfigurable control, these deviated system parameters need to be identified as precisely and quickly as possible. Meanwhile, it would further simplify the reconfigurable design task and possibly speed up the system recovery, if the system state information under the new operating circumstance can be available along with faulty parameter information. The joint parameter identification and state estimation using the combined Kalman Filter and Maximum Likelihood (KF-ML) techniques is discussed and applied in this study. The simulation results on a space robot system showed that the proposed method is quite promising in providing both faulty parameter information and state estimation in a quick, accurate and robust manner.
This work presents the management functions for a distributed computer control system, which has been developed at Centro de Pesquisas de Energia Eletrica - CEPEL with the aim to automate hydroelectric power plants and extra high voltage power substations. The initial specifications developed are based on a architecture without redundancies which initially has the objective to get the effective knowledge of the system's behaviour. In the first part is given an introduction of dependability, with two aims: to present the basic concepts and terminology, and, to present a set of techniques in the area of computing systemsfaulttolerance. In control system is introduced, in terms of its architecture and functionality. The third part introduces the management activities for the distributed system are introduced together with a general model. Finally, the fault diagnosis in the distributed system is discussed, where is presented a new diagnosis algorithm. (author)
An adaptive fault-tolerant wormhole routing algorithm based on a convex fault model in 2D meshes is presented. With the algorithm, a normal routing message, when blocked by faulty processors, would detour along some special polygons around the fault region. The result is that the proposed algorithm ...
A fault-tolerant wormhole routing algorithm using multi-phase minimal routing paths for mesh networks is proposed in this paper. When routing messages come in contact with a fault region, they always select a local shortest path around the fault-region in clockwise or counter clockwise direction. Th...
This paper presents the results of a study on fault-tolerant control of a ship propulsion benchmark (Izadi-Zamanabadi and Blanke, 999), which uses estimated or virtual measurements as feedback variables. The estimator operates on a self-adjustable design model so that its outputs can be made immune to the e®ects of a peciŻc set of component and sensor faults. The adequacy of sensor redundancy is measured using the control reconŻgurability (Wu, Zhou, and Salomon, 2000), and the number of sensor based measurements are increased when this level is found inadequate. As a result, sensor faults that are captured in the estimator's design model can be tolerated without the need for any reconŻguration actions. Simulations for the ship propulsion benchmark show that, with additional sensors added as described, satisfactory fault-tolerance is achieved under two additive sensor faults, an incipient fault, and a parametric fault, without having to alter the original controller in the benchmark.
Diagnosis and, when possible, prognosis of faults are essential for safe and reliable operation. The area of fault diagnosis has emerged over three decades. The majority of studies related to linear systems but real-life systems are complex and nonlinear. The development of methodologies coping with complex and nonlinear systems have matured and even though there are many un-solved problems, methodology and associated tools have become available in the form of theory and software for design. Genuine industrial cases have also become available. Analysis of system topology, referred to as structural analysis, has proven to be unique and simple in use and a recent extension to active structural techniques have made fault isolation possible in a wide range of systems. Following residual generation using these topologybased methods, deterministic and statistical change detection has proven very useful for on-line prognosis and diagnosis. For complex systems, results from non-Gaussian detection theory have been employed with convincing results. The paper presents the theoretical foundation for design methodologies that now appear as enabling technology for a new area of design of systems that are reliable in practise. Yet they are also affordable due to the use of fault-tolerant philosophies and tools that make engineering efforts minimal for their implementation. The paper includes examples for an autonomous aircraft and a baling system for agriculture.
The goal of an intrusion tolerant network is to continue to provide predictable and reliable communication in the presence of a limited number of compromised network components. The behavior of a compromised network component ranges from a node that no longer responds to a node that is under the control of a malicious entity that is actively trying to cause other nodes to fail. Most current data communication networks do not include support for tolerating unconstrained misbehavior of components in the network. However, the faulttolerance community has developed protocols that provide both predictable and reliable communication in the presence of the worst possible behavior of a limited number of nodes in the system. One may view a malicious entity in a communication network as a node that has failed and is behaving in an arbitrary manner. NASA/Langley Research Center has developed one such fault-tolerant computing platform called SPIDER (Scalable Processor-Independent Design for Electromagnetic Resilience). The protocols and interconnection mechanisms of SPIDER may be adapted to large-scale, distributed communication networks such as would be required for future Air Traffic Management systems. The predictability and reliability guarantees provided by the SPIDER protocols have been formally verified. This analysis can be readily adapted to similar network structures.
A time delay control methodology is adopted to cope with degraded control performance due to control surface damage of unmanned aerial vehicles, especially in the case of the automatic landing phase. It is a crucial challenge to maintain consistent control performance even under fault environments such as stuck and/or incipient actuator faults. Flight control systems designed using conventional feedback control methods in such cases may result in unsatisfactory performance, and even worse, may not guarantee the closed-loop stability, which is fatal for aircraft in the state of auto-landing. To overcome the shortfalls of the conventional approach, the time delay control scheme is adopted. This scheme is known to be robust against disturbance, model uncertainties and so on. Motivated by the fact that the abrupt and/or incipient actuator faults focused on in this paper could be considered as model uncertainties, we consider the application of the time delay controller to designing a faulttolerant control system. To show the effectiveness of the time delay control method, a nonlinear 6-DOF simulation is performed under model uncertainties and wind disturbances, and control performance is compared with that of conventional controllers in the case of multiple and single actuator faults.
A time delay control methodology is adopted to cope with degraded control performance due to control surface damage of unmanned aerial vehicles, especially in the case of the automatic landing phase. It is a crucial challenge to maintain consistent control performance even under fault environments such as stuck and/or incipient actuator faults. Flight control systems designed using conventional feedback control methods in such cases may result in unsatisfactory performance, and even worse, may not guarantee the closed-loop stability, which is fatal for aircraft in the state of auto-landing. To overcome the shortfalls of the conventional approach, the time delay control scheme is adopted. This scheme is known to be robust against disturbance, model uncertainties and so on. Motivated by the fact that the abrupt and/or incipient actuator faults focused on in this paper could be considered as model uncertainties, we consider the application of the time delay controller to designing a faulttolerant control system. To show the effectiveness of the time delay control method, a nonlinear 6-DOF simulation is performed under model uncertainties and wind disturbances, and control performance is compared with that of conventional controllers in the case of multiple and single actuator faults.
Aerodynamic parameter estimation is an integral part of aerospace system design and life cycle process. Recent advances in computational power have allowed the use of online parameter estimation techniques in varied applications such as reconfigurable or adaptive control, system health monitoring, and faulttolerant control. The combined problem of state and parameter identification leads to a nonlinear filtering problem; furthermore, many aerospace systems are characterized by nonlinear models as well as noisy and biased sensor measurements. Extended Kalman filter (EKF) is a commonly used algorithm for recursive parameter identification due to its excellent filtering properties and is based on a first order approximation of the system dynamics. Recently, the unscented Kalman filter (UKF) ...
The New Horizons spacecraft now bound for the planet Pluto and beyond to the Kuiper belt was successfully launched on January 19, 2006. The New Horizons spacecraft has backup devices for its major electronic modules, and also has a heavily cross-strapped architecture that makes the spacecraft fully tolerant of almost all single-point failures. The onboard fault protection system detects failure of a component and autonomously activates the backup system as necessary. The onboard safeing system detects anomalies in spacecraft performance, such as attitude control errors or propulsion system problems, and then autonomously places the spacecraft into one of two safe states.
In large-scale industrial plants, the process control system has multiple system servers to provide seamless services to plant operators irrespective of any system server's failure. In this paper, we present an autonomic connection scheme between the system server and the Human-Machine Interface application (HMI) without additional configuration. The proposed scheme is based on the concept of autonomic computing, supporting the fault-tolerant and/or load-balancing features. Finally, the mathematical analysis shows that the proposed scheme can provide high availability of services to users.
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime of a process. Checkpointing has been a popular method of providing faulttolerance in high-end systems. While considerable research has been done to optimize checkpointing, in practice the method still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by applications running on leadership-class machines such as the IBM Blue Gene/P at Argonne National Laboratory. We study various applications and design a methodology to assist users in understanding and choosing checkpointing frequency and reducing the overhead incurred. In particular, we study three popular applications -- the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, and a Nek5000 computational fluid dynamics application -- and analyze their memory usage and possible checkpointing trends on 32,768 processors of the Blue Gene/P system.
Motivated by the hypothesis that dilatancy plays a critical role in faulting in subduction zones, we are developing FDRA2 (Fault Dynamics with the Radiation-damping Approximation), a software package to simulate three-dimensional quasi-dynamic faulting that includes rate-state friction, thermal pressurization, and dilatancy (following Segall and Rice [1995]) in a finite-width shear zone. This work builds on the two-dimensional simulations performed by FDRA1 (Bradley and Segall [AGU 2010], Segall and Bradley [submitted]). These simulations show that at lower background effective normal stress (\\bar ?), slow slip events occur spontaneously, whereas at higher \\bar ? , slip is inertially limited. At intermediate \\bar ? , dynamic events are followed by quiescent periods and then long durations of repeating slow slip events. Models with depth-dependent properties produce sequences similar to those observed in Cascadia. Like FDRA1, FDRA2 solves partial differential equations in pressure and temperature on profiles normal to the fault. The diffusion equations are discretized in space using finite differences on a nonuniform mesh having greater density near the fault. The full system of equations is a semiexplicit index-1 differential algebraic equation (DAE) in slip, slip rate, state, fault zone porosity, pressure, and temperature. We integrate state, porosity, and slip explicitly; solve the momentum balance equation on the fault for slip rate; and integrate pressure and temperature implicitly. Adaptive time steps are limited by accuracy and the stability criterion governing explicit integration of hyperbolic, but not the more stringent one governing parabolic, PDE. To compute elasticity in a 3D half-space, FDRA2 compresses the large, dense matrix arising from the boundary element method using an H-matrix. The work to perform a matrix-vector product scales almost linearly, rather than quadratically, in the number of fault cells. A new technique to relate the error tolerance on the approximation to parameters of the compression algorithms (Bradley [submitted]) improves the compression efficiency for the third-order 1/r3 singularity with elastic Green's functions, relative to the standard method, by factors of two to five---importantly, while still providing the same straightforward error bound on the approximation. The compression (and so, roughly, the speedup) factor for a problem in which the fault is discretized by 156 ± 402 rectangles and the tolerance on the relative error is 10-5 is just over 75. We will describe our numerical methods and present preliminary simulation results.
Auction mechanisms are nowadays widely used in electronic commerce Web sites for buying and selling items among different users. The increasing importance of auction protocols in the negotiation phase is not limited to online marketplaces. In fact, the wide applicability of auctions as resource?allocation and negotiation mechanisms have also led to a great deal of interest in auctions within the agent community. A challenging issue for agents operating in open Multiagent Systems (such as the emerging semantic Web infrastructure) concerns the specification of declarative communication rules which could be published and shared allowing agents to dynamically engage well?known and trusted negotiation protocols. To cope with real?world applications, these rules should also specify faulttolerant patterns of interaction, enabling negotiating agents to interact with each other tolerating failures, for instance terminating an auction process even if some bidding agents dynamically crash. In this paper, we propose an approach to specify faulttolerant auction protocols in open and dynamic environments by means of communication rules dealing with crash failures of agents. We illustrate these concepts considering a case study about the specification of an English Auction protocol which tolerate crashes of bidding agents and we discuss its properties.
We present a tight integration of a user-level thread scheduler and a zero-copy messaging system that has been designed and optimized for scalable and efficient fine-grain parallel processing, on commodity platforms, with support for fault-tolerance. The system delivers most of the performance of the underlying communication hardware to a multi-threaded application level, while introducing little CPU overhead. This is demonstrated by a performance analysis of an implementation using off-the-shelf commodity products: PCs, running the Linux operating system, equipped with fast and gigabit Ethernet network interface cards. (21 refs).
An expanded nonlinear model inversion flight control strategy using sliding mode online learning for neural networks is presented. The proposed control strategy is implemented for a small unmanned aircraft system (UAS). This class of aircraft is very susceptible towards nonlinearities like atmospheric turbulence, model uncertainties and of course system failures. Therefore, these systems mark a sensible testbed to evaluate fault-tolerant, adaptive flight control strategies. Within this work the concept of feedback linearization is combined with feed forward neural networks to compensate for inversion errors and other nonlinear effects. Backpropagation-based adaption laws of the network weights are used for online training. Within these adaption laws the standard gradient descent backpropag...
We describe a structured system for distributed mechanism design. It consists of a sequence of layers. The lower layers deal with the operations relevant for distributed computing only, while the upper layers are concerned only with communication among players, including broadcasting and multicasting, and distributed decision making. This yields a highly flexible distributed system whose specific applications are realized as instances of its top layer. This design supports fault-tolerance, prevents manipulations and makes it possible to implement distributed policing. The system is implemented in Java. We illustrate it by discussing a number of implemented examples.
This paper presents H, a minimalistic specification language for designing heterogeneous software applications, particularly in the realms of robotics and industria, which takes advantage of a Component-Based Software Engineering (CBSE) approach. H copes with some of the most outstanding characteristics of these systems, like diversity at different levels (hardware platforms, programming languages, programmer skills), network distribution, real time and fault-tolerance. The H specification covers the life-cycle of any heterogeneous application. Its development system offers to the designer and/or builder a set of tools for specifying modules, generating code semiautomatically, debugging, maintenance, and a real time analysis of the system.
Summary Current automation solutions have a limited capability concerning agile adaptation to unexpected internal and external disturbances. Distributed intelligent control systems based on agent technologies are seen as a promising approach to handle the dynamics in large complex systems reducing the complexity, increasing flexibility and enhancing faulttolerance. We developed a new architecture of automation agents capable to answer the major requirements in the process domain. The application of these technologies enables efficient process scheduling, monitoring and diagnosis. Moreover, in the case of the changing production conditions, the presented system architecture enables a flexible automatic reconfiguration of the control software.
The cryostat for the `Herschel Space Observatory' for the European Space Agency (ESA) science program, planned for a launch with Ariane 5 in 2007, is designed for 6 days ground hold time and 3.5 years lifetime in orbit. The system comprises two tanks containing about 346 kg of liquid and superfluid Helium, with two cryogenic cold safety valves and burst disks, surrounded by three vapor cooled shields and a vacuum vessel. The safety system is two faultstolerant with three independent paths for pressure relief. The analyses of failure modes and resulting mass flows and the safety elements of the cryogenic system will be discussed.
A quorum system is a collection of subsets of nodes, called quorums, with the property that each pair of quorums have a non-empty intersection. Quorum systems are the key mathematical abstraction for ensuring consistency in fault-tolerant and highly available distributed computing. Critical for many applications since the early days of distributed computing, quorum systems have evolved from simple majorities of a set of processes to complex hierarchical collections of sets, tailored for general adversarial structures. The initial non-empty intersection property has been refined many times to a
The optimum redundancy for an avionics processor can be determined from cost and reliability considerations. The use and expense of redundant architectures are examined, along with the cost and advantages of using space-qualified parts. The advanced launch system (ALS) vehicle model was used for the comparisons. Avionics redundancy models included duplex, triple modular redundancy, and quad systems. Processors were modeled as simplex, dual self-checking pairs, or triplex checking. Cost factors were those which result in the cost per launched vehicle. These included cost of launch equipment, cost of scrubbing a launch, failure investigation, repair, and the cost of money due to schedule delays. The primary conclusion reached was that the use of redundancy to achieve faulttolerance is required for higher value missions. The use of less-highly qualified parts can lower costs for less expensive payloads, but will require a culture change to allow launching with known faults. The need for greater emphasis on determination of coverage for fault-tolerantsystems was demonstrated.
Even though the conventional probabilistic safety assessment methods are immature for applying to microprocessor-based digital systems, practical needs force to apply it. In the Korea, UCN 5 and 6 units are being constructed and Korean Next Generation Reactor is being designed using the digital instrumentation and control equipment for the safety related functions. Korean regulatory body requires probabilistic safety assessment. This paper analyzes the difficulties on the assessment of digital systems and suggests an intermediate framework for evaluating their safety using fault tree models. The framework deals with several important characteristics of digital systems including software modules and fault-tolerant features. We expect that the analysis result will provide valuable design feedback. (authors)
Due to continuous technology scaling VLSI circuits feature an increasing susceptibility to transient faults. While complete elimination of errors cannot be guaranteed, current mitigation techniques based on circuit improvement or architectural measures cause a large overhead in terms of area and energy consumption. A more efficient possibility to cope with transient faults can be to tolerate hardware errors at low physical levels and handle them at higher system levels. This can be achieved by reusing error handling capabilities - such as channel decoders - or introducing specialized error correction blocks that take advantage of the system characteristics by concentrating the effort on the components and bits most crucial for system operation. To enable this approach the influence of hard...
This work deals with faulttolerance in distributed MANET (Mobile Ad hoc Networks) systems. However, the major issue for a failure detection protocol is to confound between a fault and a voluntary or an involuntary disconnection of nodes, and therefore to suspect correct nodes to be failing and conversely. Within this context, we propose in this paper a failure detection protocol that copes with MANET systems constraints. The aim of this work is to allow to the system to launch recovery process. For this effect, our protocol, called FDAN, is based on the class of heartbeat protocols. It takes into account: no preliminary knowledge of the network, the nodes disconnection and reconnection, resources limitation...Hence, we show that by using temporary lists and different timeout levels, we achieve to reduce sensibly the number of false suspicions.
Advanced materials and structures technologies are needed for future ES platforms. ... Large deployable and/or inflatable/rigidizable aperture systems ... Space rigidizable polymers ..... battery charge controllers, fault detection, fault isolation, autonomous fault recovery, active impedance control, active noise cancellation, ...
In fault-tolerant quantum computing schemes, the overhead is often dominated by the cost of preparing codewords reliably. This cost generally increases quadratically with the block size of the underlying quantum error-correcting code. In consequence, large codes that are otherwise very efficient have found limited fault-tolerance applications. Fault-tolerant preparation circuits therefore are an important target for optimization. We study the Golay code, a 23-qubit quantum error-correcting code that protects the logical qubit to a distance of seven. In simulations, even using a naive ancilla preparation procedure, the Golay code is competitive with other codes both in terms of overhead and the tolerable noise threshold. We provide two simplified circuits for fault-tolerant preparation of Golay code-encoded ancillas. The new circuits minimize error propagation, reducing the overhead by roughly a factor of four compared to standard encoding circuits. By adapting the malignant set counting technique to depolariz...
Many practical scientific applications would benefit from a simple checkpointing mechanism to provide automatic restart or recovery in response to faults and failures. CUMULVS is a middleware infrastructure for interacting with parallel scientific simulations to support online visualization and computational steering. The base CUMULVS system has been extended to provide a user-level mechanism for collecting checkpoints in a parallel simulation program. Via the same interface that CUMULVS uses to identify and describe data fields for visualization and parameters for steering, the user application can select the minimal program state necessary to restart or migrate an application task. The CUMULVS run-time system uses this information to efficiently recover fault-tolerant applications by restarting failed tasks. Application tasks can also be migrated -- even across heterogeneous architecture boundaries -- to achieve load balancing or to improve the task`s locality with a required resource. This paper describes the CUMULVS interface for checkpointing, the issues faced in utilizing this interface when developing fault-tolerant and migrating applications, and the direction of future research in this area.
A complex pattern of active faults occurs in the southern Amargosa Desert, southern Nye, County, Nevada. These faults can be grouped into three main faultsystems: (1) a NE-striking zone of faults that forms the southwest extension of the left-lateral Rock Valley fault zone, in the much larger Spotted Range-Mine Mountain structural zone, (2) a N-striking fault zone coinciding with a NNW-trending alignment of springs that is either a northward continuation of a fault along the west side of the Resting Spring Range or a N-striking branch fault of the Pahrump faultsystem, and (3) a NW-striking fault zone which is parallel to the Pahrump faultsystem, but is offset approximately 5 km with a left step in southern Ash Meadows. These three fault zones suggest extension is occurring in an E-W direction, which is compatible with the {approximately}N10W structural grain prevalent in the Death Valley extensional region to the west.
The application of concatenated codes to faulttolerant quantum computing is discussed. We have previously shown that for quantum memories and quantum communication, a state can be transmitted with error {epsilon} provided each gate has error at most c{epsilon}. We show how this can be used with Shor`s faulttolerant operations to reduce the accuracy requirements when maintaining states not currently participating in the computation. Viewing Shor`s faulttolerant operations as a method for reducing the error of operations, we give a concatenated implementation which promises to propagate the reduction hierarchically. This has the potential of reducing the accuracy requirements in long computations.
A long-standing open problem in fault-tolerant quantum computation has been to find a universal set of transversal gates. As three of us proved in arXiv: 0706.1382, such a set does not exist for binary stabilizer codes. Here we generalize our work to show that for subsystem stabilizer codes in $d$ dimensional Hilbert space, such a universal set of transversal gates cannot exist for even one encoded qudit, for any dimension $d$, prime or nonprime. This result strongly supports the idea that other primitives, such as quantum teleportation, are necessary for universal fault-tolerant quantum computation, and may be an important factor for faulttolerance noise thresholds.
The focus of situation-aware ubiquitous computing has increased lately. An example of situation-aware applications is a multimedia education system. Since ubiquitous applications need situation-aware middleware services and computing environment keeps changing as the applications change, it is challenging to detect errors and recover them in order to provide seamless services and avoid a single point of failure. This paper proposes an Adaptive FaultTolerance Agent (AFTA) in situation-aware middleware framework and presents its simulation model of AFT-based agents. The strong point of this system is to detect and recover error automatically in case that the session's process comes to an end through a software error.
The major goals of this effort are as follows: (1) to examine technology insertion options to optimize Advanced Information Processing System (AIPS) performance in the Advanced Launch System (ALS) environment; (2) to examine the AIPS concepts to ensure that valuable new technologies are not excluded from the AIPS/ALS implementations; (3) to examine advanced microprocessors applicable to AIPS/ALS, (4) to examine radiation hardening technologies applicable to AIPS/ALS; (5) to reach conclusions on AIPS hardware building blocks implementation technologies; and (6) reach conclusions on appropriate architectural improvements. The hardware building blocks are the Fault-Tolerant Processor, the Input/Output Sequencers (IOS), and the Intercomputer Interface Sequencers (ICIS).
This report presents a propulsion system model for a low speed marine vehicle, which can be used as a test benchmark for Fault-Tolerant Control purposes. The benchmark serves the purpose of offering realistic and challenging problems relevant in both FDI and (autonomous) supervisory control area. The propulsion system model is presented in two versions: the first one consists of one engine and one propeller, and the othe one consists of two engines and their corresponding propellers placed in parallel in the ship. The corresponding programs are developed and are available.
We propose an approach for single spin measurement. Our method uses techniques from the theory of quantum cellular automata to correlate a large amount of ancillary spins to the one to be measured. It has the distinct advantage of being efficient, and to a certain extent fault-tolerant. Under ideal conditions, it requires the application of only order of cube root of N steps (each requiring a constant number of rf pulses) to create a system of N correlated spins. It is also fairly robust against pulse errors, imperfect initial polarization of the ancilla spin system, and does not rely on entanglement. We study the scalability of our scheme through numerical simulation.
Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different faulttolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint - a snapshot of the whole workflow enactment state at normal execution (without failures) - has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows...
We discuss how to implement quantum computation on a system with an intrinsic Hamiltonian by controlling a limited subset of spins. Our primary result is an efficient control sequence on a chain of hopping, non-interacting, fermions through control of a single site and its interaction with its neighbor. This is applicable to a wide class of spin chains through the Jordan-Wigner transformation. We also discuss how an array of sites can be controlled to give sufficient parallelism for the implementation of fault-tolerant circuits. The framework provides a vehicle to expose the contradictions between the control theoretic concept of controllability with the ability of a system to perform quantum computation.
We present a solid-state laser system that generates 750 mW of continuous-wave single-frequency output at 313 nm. Sum-frequency generation with fiber lasers at 1550 nm and 1051 nm produces up to 2 W at 626 nm. This visible light is then converted to UV by cavity-enhanced second-harmonic generation. The laser output can be tuned over a 495 GHz range, which includes the 9Be+ laser cooling and repumping transitions. This is the first report of a narrow-linewidth laser system with sufficient power to perform fault-tolerant quantum-gate operations with trapped 9Be+ ions by use of stimulated Raman transitions.
Recent experience indicates that the LCLS undulator segments must not, at any time following tuning, be allowed to change temperature by more than about {+-}2.5 C or the magnetic center will irreversibly shift outside of acceptable tolerances. This vulnerability raises a concern that under fault conditions the ambient temperature in the Undulator Hall might go outside of the safe range and potentially could require removal and retuning of all the segments. In this note we estimate changes that can be expected in the Undulator Hall air temperature for three fault scenarios: (1) System-wide power failure; (2) Heating Ventilation and Air Conditioning (HVAC) system shutdown; and (3) HVAC system temperature regulation fault. We find that for either a system-wide power failure or an HVAC system shutdown (with the technical equipment left on), the short-term temperature changes of the air would be modest due to the ability of the walls and floor to act as a heat ballast. No action would be needed to protect the undulator system in the event of a system-wide power failure. Some action to adjust the heat balance, in the case of the HVAC power failure with the equipment left on, might be desirable but is not required. On the other hand, a temperature regulation failure of the HVAC system can quickly cause large excursions in air temperature and prompt action would be required to avoid damage to the undulator system.
This paper deals with reliability of motor drive systems. It focuses on the switching power converter which is the weakest drive part. It investigates a new architecture which provides intrinsic redundancy. The method used in this fault switch detection and diagnosis is based on a sliding mode observer. The original aspect of this new approach is that the observer influences the control algorithm in order to rule out or accept the possibility of the device failure. This leads to the right decision which induces the rebuilding of the converter topology. This faulttolerant control strategy assures continuous operation even though one switch failure occurs. This article has been submitted as part of “IET Colloquium on Reliability in Electromagnetic Systems”, 24 and 25 May 2007, Paris
This paper gives an overview of the SMZO-PACS-Project in the form of a rough specification of the system architecture and the functional parameters related to it. The PACS architecture, determined by the large amount of data volume produced in the SMZO Hospital is outlined. In both the radiology and the trauma department high technical requirements concerning data throughput and faulttolerance are demanded. Therefore these PACS modules are designed to minimize the workload of the network so that performance is not degraded in the case of fault of a single component. A PACS module includes image acquisition devices of a certain modality with related reporting workstations and a distributed electronic archive. Functionality of modules is described, special interest is posed on the integration of the different information management systems PACS, RIS and HIS, to achieve a complete record of data input and throughput in the hospital. (author). 11 refs., 3 figs., 1 tab.
In order to identify and segment the active faults, the literatures of structural geology, paleoseismology, and geophysical explorations were investigated. The existing structural geological criteria for segmenting active faults were examined. These are mostly based on normal faultsystems, thus, the additional criteria are demanded for application to different types of faultsystems. Definition of the seismogenic fault, characteristics of fault activity, criteria and study results of fault segmentation, relationship between segmented fault length and maximum displacement, and estimation of seismic risk of segmented faults were examined in paleoseismic study. The history of earthquake such as dynamic pattern of faults, return period, and magnitude of the maximum earthquake originated by fault activity can be revealed by the study. It is confirmed through various case studies that numerous geophysical explorations including electrical resistivity, land seismic, marine seismic, ground-penetrating radar, magnetic, and gravity surveys have been efficiently applied to the recognition and segmentation of active faults.
This paper presents the methods of fault type classification and fault section estimation for high speed relaying in transmission systems using neural networks. Conventional distance relay determines fault distance by calculating impedance between relay and fault point. But, in this paper, fault types are classified by neural network with inputs using the instantaneous values of voltages and currents and then fault section estimator using this results estimates directly the fault section. We simulates KEPCO`s equivalent transmission system(Gori-Sinyangsan) using ATP(alternative transients program)to show the possibility of neural network methods and the proposed methods can determine rapidly fault section. (author). 14 refs., 7 figs., 5 tabs.
Detailed marine geophysical surveys of the inner California continental borderland west of northern Baja California show that the region is underlain by two major, northwest-trending, Quaternary, dextral wrench faultsystems. The San Clemente faultsystem lies along the western part of the inner borderland and is delineated by the San Clemente and San Isidro fault zones. Together, these fault zones connect to form a long (300 km), narrow (5-10 km), continuous zone of faulting that is very similar to the larger San Andreas faultsystem onshore. The Agua Blanca faultsystem is a complex zone of shear delineated by three or more subparallel wrench fault zones in the eastern part of the inner borderland. The westernmost San Diego Trough-Bahia Soledad fault zone consists of relatively long (50 km), continuous, main fault traces which cut the Quaternary sediments of the nearshore basin trough. The Coronado Bank-Agua Blanca fault zone is more complicated, with numerous discontinuous, subparallel, right- and left-stepping, anastomosing fault traces which are associated with significant structural relief. A nearshore zone of faults, marked by the Newport-Inglewood-Rose Canyon fault zone in the north and the Estero-Descanso fault zone in the south, parallels the coast and defines the eastern boundary of the California continental borderland structural province. All of these eastern fault zones merge into the transpeninsular Agua Blanca fault, and their N30/sup 0/W trend differs substantially from the trend of the major peninsular ranges fault zones.
Workstation-based distributed computing environments are getting popular in both academic and commercial communities due to the continuing trend of decreasing cost/performance ratio and rapid development of networking technology. However, the work load on these workstations is usually much lower than their computing capacity, especially with the ever-increasing computing power of new hardware. As a result, the resources of such workstations are often under-utilized and many of them are frequently idle. A preemptive process migration facility can be provided, in such a distributed system, to dynamically relocate running processes among the component machines. Such relocation can help cope with dynamic fluctuations in loads and service needs, improve the systemsfaulttolerance, meet real-time scheduling deadlines, or bring a process to a special device. This paper presents a process migration subsystem for tolerating process and node failures on a workstation based environment. The design and implementation of the subsystem are also discussed.
Due to the beamline space constrictions and the modular design of the vacuum system, a conventional bellows can not be used everywhere in the PEP-II High-Energy Ring (HER) arcs. A zero-length ``Flex Flange`` was developed which actually performs better than a more standard bellows. The Flex Flange fits the space available while still preserving the modularity of the system. Furthermore, the design provides for an accurate match-up between adjoining octagonal copper chambers despite the large fabrication and assembly tolerances and high operational loads. Beam chamber continuity is ensured by an integral RF seal ring which is easy to install and fault-tolerant. Heating from synchrotron radiation and higher-order mode trapping is managed to ensure a robust connection despite the 3,000 mA beam current of the PEP-II HER. The Flex Flange concept is versatile and adaptable to many applications, yet economical both in space needed and cost.
This topic is devoted to communication issues in scalable compute and storage systems, such as parallel computers, networks of workstations, and clusters. All aspects of communication in modern systems were solicited, including advances in the design, implementation, and evaluation of interconnection networks, network interfaces, system and storage area networks, on-chip interconnects, communication protocols, routing and communication algorithms, and communication aspects of parallel and distributed algorithms. In total 15 papers were submitted to this topic of which we selected the 7 strongest papers. We grouped the papers in two sessions of 3 papers each and one paper was selected for the best paper session. We noted a number of papers dealing with changing topologies, stability and forwarding convergence in source routing based cluster interconnect network architectures. We grouped these for the first session. The authors of the paper titled: “Implementing a Change Assimilation Mechanism for Source Routing Interconnects” propose a mechanism that can obtain the new topology, and compute and distribute a new set of fabric paths to the source routed network end points to minimize the impact on the forwarding service. The article entitled “Dependability Analysis of a Fault-tolerant Network Reconfiguration Strateg” reports on a case study analyzing the effects of network size, mean time to node failure, mean time to node repair, mean time to network repair and coverage of the failure when using a 2D mesh network with a fault-tolerant mechanism (similar to the one used in the BlueGene/L system), that is able to remove rows and/or columns in the presence of failures. The last paper in this session: “RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration” presents a new dynamic reconfiguration method, that ensures deadlock-freedom during the reconfiguration without causing performance degradation such as increased latency or decreased throughput. The second session groups 3 papers presenting methods, protocols and architectures that enhance capacities in the Networks. The paper titled: “NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet” presents the addition of multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are treated on the same core. It then introduces the idea of binding the target end process near its dedicated receive queue. In general this multiqueue receive stack performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult. The authors of: “A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks” focus on the problem of faulttolerance for high-speed interconnection networks by designing a faulttolerant routing method. The goal was to solve a certain number of link and node failures, considering its impact, and occurrence probability. Their experiments show that their method allows applications to successfully finalize their execution in the presence of several faults, with an average performance value of 97% with respect to the fault-free scenarios. The paper: “Hardware implementation study of the Self-Clocked Fair Queuing Credit Aware (SCFQ-CA) and Deficit Round Robin Credit Aware (DRR-CA) scheduling algorithms” proposes specific implementations of the two schedulers taking into account the characteristics of current high-performance networks. A comparison is presented on the complexity of these two algorithms in terms of silicon area and computation delay. Finally we selected one paper for the special paper session: “A Case Study of Communication Optimizations on 3D Mesh Interconnects”. In this paper the authors present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
This paper presents an interconnect resilient (IR) methodology with maximal interconnect faulttolerance, yield, and reliability for both single and multiple interconnect faults under stuck-at and open fault models. By exploiting multiple routes inherent in an interconnect structure, this method can tolerate faulty connections by efficiently finding alternative paths. The proposed approach is compatible with previous interconnect detection and diagnosis methods under oscillation ring schemes, and together they can be applied to implement a robust interconnect structure that may still provide correct communication even under multiple link faults in Network-on-Chips (NoCs). With such knowledge, designers can significantly improve interconnect reliability by augmenting vulnerable interconnect structures in NoCs. As a result, the experimental results show that alternative paths in NoCs can be found for almost all paths. Hence, the proposed method provides a good way to achieve faulttolerance and reliability/yield improvement.
In quantum computation the importance of faulttolerance is paramount, due to the low reliability of the quantum circuit components. Therefore, several faulttolerance assessing tools and methodologies have been developed; most of them are analytic, dependent on the adopted fault model, and based on some simplifying assumptions. Simulation could have been a more realistic and accurate alternative had it not be confronted with the high complexity of simulating quantum circuits. However, a hardware description language (HDL) implementation for simulated fault injection (SFI) was proposed and tested for limited-size quantum circuits. This paper proposes a new, hybrid simulation-analytic, SFI-based methodology for quantum circuit faulttolerance assessment that is scalable to arbitrary size ci...
Motivated from the well-known problem of establishing efficient diagnostic techniques for detecting faults in fault-tolerant computer systems we study a problem for computing majority with restricted tests in a set of items of two types (e.g., faulty and non-faulty). Stated in a more abstract form, consider a bin containing n balls colored with two colors. In a k-query, k balls are selected by a questioner and an oracle gives an answer which (depending on the computation model being considered) is related to the distribution of colors of the balls in this k-tuple. The oracle never reveals the colors of the individual balls. Following a number of queries and answers the questioner is said to determine majority if either (1) it can output a ball of the majority color provided that such a col...
The Modular Modeling System (B W MMS) is being used as a design tool to verify robust controller designs for improving power plant performance while also providing fault-accommodating capabilities. These controllers are designed based on optimal control theory and are thus model based controllers which are targeted for implementation in a computer based digital control environment. The MMS is being successfully used to verify that the controllers are tolerant of uncertainties between the plant model employed in the controller and the actual plant; i.e., that they are robust. The two areas in which the MMS is being used for this purpose is in the design of (1) a reactor power controller with improved reactor temperature response, and (2) the design of a multiple input multiple output (MIMO) robust fault-accommodating controller for a deaerator level and pressure control problem.
As High-End Computing machines continue to grow in size, issues such as faulttolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.
In control design, fault-identification and faulttolerant control, the controlled process is usually perceived as a dynamical process, captured in a mathematical model. The design of a control system for a complex process, however, begins typically long before these mathematical models become relevant and available. To consider the role of control functions in process design, a good qualitative understanding of the process as well as of control functions is required. As the purpose of a control function is closely tied to the process functions, its failure has a direct effects on the process behaviour and its function. This paper presents a formal methodology for the qualitative representation of control functions in relation to their process context. Different types of relevant process and control abstractions are introduced and their application to formal analysis of control failure modes from a process perspective is presented. Finally anticipated applications in context of offline analysis and online supervisory control are discussed.
The objective of the methods within the framework of the plug and play process control and particularly faulttolerant control is to establish control techniques which guarantee a certain performance through control reconfiguration at the occurrence of the faults or changes. These methods cannot be effective if sufficient redundancy does not exist in the process. A measure for control reconfigurability which reveals the level of redundancy in connection with feedback control is proposed in this paper for bilinear systems. The proposed control reconfigurability measure is the extension of its gramian-based analogous counterpart, which has been previously proposed for the linear processes. The control reconfigurability is calculated for the bilinear models of an electro-hydraulic drive to show its relevance to redundant actuating capabilities in the models.
A properly designed monitoring and diagnostic system must be capable of detecting and distinguishing sensor and process malfunctions in the presence of signal noise, varying process states and multiple faults. The technique presented in this paper addresses these objectives through the implementation of a multivariate state estimation algorithm based upon pattern recognition methodology coupled with a statistically-based hypothesis test. Utilizing a residual signal vector generated from the difference between the estimated and measured current states of a process, disturbances are detected and identified with statistical hypothesis testing. Since the hypothesis testing utilizes the inherent noise on the signals to obtain a conclusion and the state estimation algorithm requires only a majority of the sensors to be functioning to ascertain the current state, this technique has proven to be quite robust and fault-tolerant. Several examples of its application are presented.
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introdu...
minimize stress limitations; wire materials and coatings for high temperature .... Faulttolerance is another effort to prevent crash-down events. .... Dellacorte, C, Valco, M.J., “Load Capacity Estimation of Foil Air Journal Bearings for Oil-Free ...
typically provide a greater degree of robustness or faulttolerance than the von Neu- mann sequential ... The learning schemes of a Learning Fuzzy Logic Con- ...... HQPfieid Neural Nets for Pattern Classification, MIT Lincoln Laboratory Techni- ...
Fault-tolerant motion of redundant manipulators can be obtained by joint velocity reconfiguration. For fault-tolerant manipulators, it is beneficial to determine the configurations that can tolerate the locked-joint failures with a minimum relative joint velocity jump, because the manipulator can rapidly reconfigure itself to tolerate the fault. This paper uses the properties of the condition numbers to introduce those optimal configurations for serial manipulators. The relationship between the manipulator's locked-joint failures and the condition number of the Jacobian matrix is indicated by using a matrix perturbation methodology. Then, it is observed that the condition number provides an upper bound of the required relative joint velocity change for recovering the faults which leads to ...
Aug 31, 2011 ... I am currently in the US Navy as a nuclear propulsion officer candidate, ... My interests include robotics, artificial intelligence, and faulttolerant controls. ... Map with a Back-Propagation Neural Network in order to detect and ...
With respect to scalability and arbitrary topologies of the underlying networks in multiprogramming and multithread environments, faulttolerance in acknowledged ATAB and concurrent communications become a challenge to reliable general wormhole routing multicomputers with arbitrary topologies. In th...
With respect to scalability and arbitrary topologies of the underlying networks in multiprogramming and multithread environment, faulttolerance in acknowledged ATAB and concurrent communications become a challenge to reliable general wormhole routing multicompter with arbitrary topologies. In this ...
Case Study of the UARS and ERBS End-of-Mission Plans ..... support components including the transmitters and power amplifier remain single faulttolerant. ... tanks indicated that a failure of the tank bladder had occurred and that the thrusters ...
One important issue in the motion planning of a kinematic redundant manipulator is faulttolerance. In general, if the motion planner is faulttolerant, the manipulator can achieve the required path of the end-effector even when its joint fails. In this situation, the contribution of the faulty joint to the end-effector is required to be compensated by the healthy joints to maintain the prescribed end-effector trajectory. To achieve this, this paper proposes a fault-tolerant motion planning scheme by adding a simple fault-tolerant equality constraint for the faulty joint. Such a scheme is then unified into a quadratic program (QP), which incorporates joint-physical constraints such as joint limits and joint-velocity limits. In addition, a numerical computing solver based on linear variatio...
Abstract: Faulttolerant control is considered for a nonlinear aircraft model expressed as a ... Reconfigurability tj2j is calculated for this model with respect to the loss of control effectiveness ... is a linear time-varying process whose state- space ...
We present and analyze protocols for fault-tolerant quantum computing using color codes. We present circuit-level schemes for extracting the error syndrome of these codes fault-tolerantly. We further present an integer-program-based decoding algorithm for identifying the most likely error given the syndrome. We simulated our syndrome extraction and decoding algorithms against three physically-motivated noise models using Monte Carlo methods, and used the simulations to estimate the corresponding accuracy thresholds for fault-tolerant quantum error correction. We also used a self-avoiding walk analysis to lower-bound the accuracy threshold for two of these noise models. We present and analyze two architectures for fault-tolerantly computing with these codes: one with 2D arrays of qubits are stacked atop each other and one in a single 2D substrate. Our analysis demonstrates that color codes perform slightly better than Kitaev's surface codes when circuit details are ignored. When these details are considered, w...
...tanks under a lightning strike event. The existing...able to tolerate lightning current without...that provide the locations of additional...fault current or lightning strike, which could...regulations/ibr_locations.html....
The gas turbine industry has a continued interest in improving engine ... The fault -tolerant magnetic bearing test facility was upgraded to operate to .... to create an ultraefficient, clean, intelligent, versatile, and durable gas turbine engine.
Abstract Spatially overlapping plates in tiled configurations represent designs that are observed widely in nature (e.g., fish and snake scales) and man-made systems (e.g., shingled roofs) alike. This imbricate architecture offers fault-tolerant, multifunctional capabilities, in layouts that can provide mechanical flexibility even with full, 100% areal coverages of rigid plates. Here, the realization of such designs in microsystems technologies is presented, using a manufacturing approach that exploits strategies for deterministic materials assembly based on advanced forms of transfer printing. The architectures include heterogeneous combinations of silicon, photonic, and plasmonic scales, in imbricate layouts, anchored at their centers or edges to underlying substrates, ranging from elast...
Battle Management and Command, Control and Communications (BM/C3) issues in the Strategic Defense Initiative (SDI) context were discussed at a two-day workshop at the Institute for Defense Analyses (IDA). Another workshop-probed civilian systems that require handling of large amounts of data and that have fault-tolerant features that may be useful in resolving the SDI BM/C3 problems. The findings are summarized in this report, with special emphasis on possible research to be supported by the Innovative Science and Technology Office of SDIO.
This paper addresses issues central to the design and operation of an ultrareliable, Byzantine resilient parallel computer. Interprocessor connectivity requirements are met by treating connectivity as a resource that is shared among many processing elements, allowing flexibility in their configuration and reducing complexity. Redundant groups are synchronized solely by message transmissions and receptions, which aslo provide input data consistency and output voting. Reliability analysis results are presented that demonstrate the reduced failure probability of such a system. Performance analysis results are presented that quantify the temporal overhead involved in executing such fault-tolerance-specific operations. Empirical performance measurements of prototypes of the architecture are presented. 30 refs.
Database replication is widely used for fault-tolerance, scalability and performance. The failure of one database replica does not stop the system from working as available replicas can take over the tasks of the failed replica. Scalability can be achieved by distributing the load across all replicas, and adding new replicas should the load increase. Finally, database replication can provide fast local access, even if clients are geographically distributed clients, if data copies are located close to clients. Despite its advantages, replication is not a straightforward technique to apply, and
Mobile agent computing is being used in fields as diverse as artificial intelligence, computational economics and robotics. Agents' ability to adapt dynamically and execute asynchronously and autonomously brings potential advantages in terms of fault-tolerance, flexibility and simplicity. This monograph focuses on studying mobile agents as modelled in distributed systems research and in particular within the framework of research performed in the distributed algorithms community. It studies the fundamental question of how to achieve {\\em rendezvous}, the gathering of two or more agents at the
During the six month tenure of the grant, activities included continued research of hydrostatic bearings as a viable backup-bearing solution for a magnetically levitated shaft system in extreme temperature environments (1000 F), developmental upgrades of the fault-tolerant magnetic bearing rig at the NASA Glenn Research Center, and assisting in the development of a conical magnetic bearing for extreme temperature environments, particularly turbomachinery. It leveraged work from the ongoing Smart Efficient Components (SEC) and the Turbine-Based Combined Cycle (TBCC) program at NASA Glenn Research Center. The effort was useful in providing technology for more efficient and powerful gas turbine engines.
Quantum computers can (in theory) solve certain problems far faster than a classical computer running any known classical algorithm. While existing technologies for building quantum computers are in their infancy, it is not too early to consider their scalability and reliability in the context of the design of large-scale quantum computers. To architect such systems, one must understand what it takes to design and model a balanced, fault-tolerant quantum computer architecture. The goal of this lecture is to provide architectural abstractions for the design of a quantum computer and to explore
Summary form only given, substantially as follows. The authors describe the structure of multiprocessor special computers necessary for application to analysis for and control of networks. The processors are made as separate uniform computers. These processors are unified in the structure similar to the researching networks in the processes of the problem's decision. The principles of operation of the all-digital communication-and-switching system between the processors permit quick changes of the configuration of the problem. It is basis for construction of the computer survivability. The parallel special computer can survive under the different components failures. These computers are quite effective for the control different fault-tolerant networks and objects.
Currently most distributed telecoms software is engineered using low- and mid-level distributed technologies, but there is a drive to use high-level distribution. This paper reports the first systematic comparison of a high-level distributed programming language in the context of substantial commercial products. Our research strategy is to reengineer some C++/CORBA telecoms applications in ERLANG, a high-level distributed language, and make comparative measurements. Investigating the potential advantages of the high-level ERLANG technology shows that two significant benefits are realized. Firstly, robust configurable systems are easily developed using the high-level constructs for faulttolerance and distribution. The ERLANG code exhibits resilience: sustaining throughput at extreme loads ...
We are developing software to explore the faulttolerance of quantum dot cellular automata gate architectures in the presence of manufacturing variations and device defects. The Topology Optimization Methodology using Applied Statistics (TOMAS) framework extends the capabilities of the A Quantum Interconnected Network Array Simulator (AQUINAS) by adding front-end and back-end software and creating an environment that integrates all of these components. The front-end tools establish all simulation parameters, configure the simulation system, automate the Monte Carlo generation of simulation files, and execute the simulation of these files. The back-end tools perform automated data parsing, statistical analysis and report generation.
This paper proposes a new driving scheme for insulated gate bipolar junction transistors (IGBTs) and thyristors used for high power conversion. Most power conversion techniques are based on switching actions so that gate driving scheme and their related circuits have important roles in power conversion. In this paper, fault-tolerant gate driving schemes for power switches and their power supply that utilizes stored energy in the system are presented. Experiments have been carried out with 6500V-rated IGBTs and thyristors to verify the validity of the proposed driving scheme.
In this paper, we present a new protocol for optimistic rollback recovery in distributed systems. This protocol is completely asynchronous, minimizes rollback, and is independent of any particular underlying distributed computation to be made faulttolerant. This protocol improves on earlier work in asynchronous optimistic rollback recovery in that previous protocols either sacrificed some of these properties or required larger timestamps. Furthermore, we establish that this protocol is optimal, in that no rollback recovery protocol can achieve these properties and have asymptotically smaller timestamps.
Faults in complex tectonic environments interact in various ways, including triggered rupture of one fault by another, that may increase seismic hazard in the surrounding region. We model static and dynamic fault interactions between the strike-slip and thrust faultsystems in southern California. We find that rupture of the Sierra Madre-Cucamonga thrust faultsystem is unlikely to trigger rupture of the San Andreas or San Jacinto strike-slip faults. However, a large northern San Jacinto fault earthquake could trigger a cascading rupture of the Sierra Madre-Cucamonga system, potentially causing a moment magnitude 7.5 to 7.8 earthquake on the edge of the Los Angeles metropolitan region. PMID:14671298
The main purpose of this work has been to achieve active fault-tolerance in control systems, defined as a methodology where fault detection and isolation techniques are combined with supervisory control to achieve autonomous accommodation of faults before they develop into failures. The aim of this work has been to develop and employ concepts and methods that are suitable for use in different automation processes, with applicability in various industrial fields. The requirements for high productivity and quality has resulted in employing additional instrumentation and use of more sophisticated control algorithms. The drawback is, however, that these control systems have become more vulnerable to even simple faults in instrumentation. On the other hand, due to cost-optimality requirements, an extensive use of hardware redundancy has been prohibited. Nevertheless, the dependency and availability could be increased through enhancing control systems' ability to on-line perform fault detection and reconfiguration when a fault occurs and before a safety system shuts-down the entire process. The main contributions of this research effort are development and experimentation with methodologies for systematic analysis of reconfiguration and design of supervisor logic. In addition, useful experience is obtained through implementation of a fault-tolerant control scheme against a simulated ship and its propulsion system. A development methodology, which was suggested in the Control Engineering Department, is extended to cope with the important reconfiguration problem. In order to enable a designer to acquire knowledge about reconfiguration possibilities, the structural analysis method is added as an extension to the existing methodology. This extension builds upon the earlier method where fault propagation and severity analysis are the essential parts. Structural analysis (SA) enables the designer to distinguish between the parts of the systems with no redundant information and the parts with possible redundant information. This method, hence, provides the designer with information, which is necessary during the selection of remedial actions. Furthermore, it is shown how sensor information fusion is obtained by using the SA method. The construction of the supervisor's decision logic is essential for the active form of fault-tolerant control. In this regard, two approaches has been presented. The first aims at constructing the decision logic in form of a ``language''. This language is obtained as a direct result of the component based approach, presented in this thesis. This approach is based on the definition of a functional component, components placement in a control system hierarchy and the definition of system level hierarchy. The supervisor language includes all valid strings, representing the combination of valid components, that keep the system functional. This approach is simple and can be automated. In the second approach, implementation of supervisor functionality is realized on the basis of an extension to the traditional state-event machines. Due to parallelity (inherent modularity) the supervisor logic is more easily modified, updated, maintained, and tested. A salient feature is that a change in one task only necessitates redesign of essentially one corresponding state-event machine (SEM). A heuristic guideline is provided for designing the logic in form of SEMs. A ship propulsion system benchmark has been designed and used as a case study. This includes experimentation with the above methodologies and implementation of a fault-tolerant control against the simulation. Four generic faults have been considered. It has been shown how the SA method is easily employed to generate analytical redundancy relations, which in turn are then used for FDI purposes. Three different methods are used to generate residuals. These methods are: simple numerical calculation, a non-linear observer, and a Neuro-Fuzzy method. Employment of each method follows the assumption about the available system information. The results show that it is possible to detect and identify the faults. Obtained results emphasizes the significance of acquiring detailed knowledge about the system behavior during non-faulty as well as faulty situations. Using the structural model of the system, it is illustrated how to perform sensor information fusion when a sensor fault occurs. It is finally shown how the decision logic of the supervisor for the benchmark is designed. Main results in this dissertation have been presented at international conferences and parts of the dissertation have also been published in international journals.
The Hokkaido Electric Power Company configurated a centralized monitoring automation system of function distribution type with the automation system of its Nayoro Power Station, and started its operation in March 1990. This report introduces the outline of the system. In the function distribution type system, function processors are divided and isolated by the function, the signal input and output devices and other terminal devices are made intelligent, and these devices are combined systematically with high-speed and highly reliable LAN. The system management device, the monitoring operation processor, the record collection processor, the automatic control processor, the data relay device, the system monitoring panel device, the CRT device, the printer device, the MSG transmission device (exchanges various information with upper systems) and 1:NTC (the master station) are combined with highway buses. And this system has effective faulttolerance, processing capacity at high loads, responsibility and function extendibility. 8 figs., 2 tabs.
A study was performed to develop a fluid system design and show the feasibility of constructing an integrated modular engine (IME) configuration, using an expander cycle engine. The primary design goal of the IME configuration was to improve the propulsion system reliability. The IME fluid system was designed as a single faulttolerantsystem, while minimizing the required fluid components. This study addresses the design of the high pressure manifolds, turbopumps and thrust chambers for the IME configuration. A physical layout drawing was made, which located each of the fluid system components, manifolds and thrust chambers. Finally, a comparison was made between the fluid system designs of an IME system and a non-network (clustered) engine system.
FT-GReLoSSS (FTG) is a C++/MPI framework to ease the development of fault-tolerant parallel applications belonging to a SPMD family termed GReLoSSS. The originality of FTG is to rely on the MoLOToF programming model principles to facilitate the addition of an efficient checkpoint-based fault toleran...
In this paper, the faulttolerant control problem is addressed in a networked framework. An isolation filter together with a fault compensation mechanism are proposed for FDI/FTC. Several design procedures are studied. First, in the centralized architecture, the inputs and outputs information used f...
This study is part of the development of faulttolerance function for an electro-mechanical wedge brake actuator at the Robert Bosch GmbH. Through the Fault Mode and Effect Analysis (FMEA) on the present brake actuator design, the AMR angle sensor (AMR: anisotropic magnetic resistance) is identified...
Advanced control schemes can be used to optimize energy production and cost of energy in modern wind turbines. These control schemes most often rely on wind speed estimations. These designs of wind speed estimators are, however, not designed to be faulttolerant towards faults in the used sensors. I...
In this report an actuator fault-tolerant control (FTC) strategy based on set separation is presented. The proposed scheme employs a bank of observers which match the different fault situations that can occur in the plant. Each of these observers has an associated estimation error with a distinctive...
In this paper the authors investigate the robustness of Artificial Neural Networks when encountering transient modification of information bits related to the network operation. These kinds of faults are likely to occur as a consequence of interaction with radiation. Results of tests performed to evaluate the faulttolerance properties of two different digital neural circuits are presented.
In this paper, an actuator fault-tolerant control (FTC) strategy based on set separation is presented. The proposed scheme employs a standard configuration consisting of a bank of observers which match the different fault situations that can occur in the plant. Each of these observers has an associa...
A new technology gyro, the hemispherical resonator gyro (HRG), has been developed which is ideally suited to strapdown navigation systems. The instrument offers greater reliability than current sensors due to its inherent simplicity. A fault-tolerant navigation system has been developed as a technology testbed to exhibit the instrument's capabilities. A successful flight test program of the system was completed in late 1989. This fault-tolerant navigation system is configured with six independent sensor channels consisting of paired gyros and accelerometers. Each sensor channel utilizes a microprocessor, programmed in C, to control sensor readout and compensate for thermal errors. A central navigation processor, programmed in Ada, combines data from the independent sensor channels to provide redundancy management and an optimized navigation solution. A dodecahedron mounting structure for the instrument cluster provides an optimum alignment of the redundant inertial axes. Redundancy management of the sensors is accomplished by an enhanced generalized likelihood test. Key features demonstrated for the HRG navigator are rapid reaction, operation over a wide temperature range, and navigation accuracy.
Swarm robotics is an example of a complex system with interactions among distributed autonomous robots as well with the environment. Within the swarm there is no centralised control, behaviour emerges from interactions between agents within the swarm. Agents within the swarm exhibit time varying behaviour in dynamic environments, and are subject to a variety of possible anomalies. The focus within our work is on specific faults in individual robots that can affect the global performance of the robotic swarm. We argue that classical approaches for achieving tolerance through implicit redundancy is insufficient in some cases and additional measures should be explored. Our contribution is to demonstrate that tolerance through explicit detection with statistical techniques works well and is su...
TDAQ system requires a comprehensive and flexible control system. Its role ranges from the so-called run-control, e.g. starting and stopping the data taking, to error handling and faulttolerance. It also includes initialization and verification of the overall system. Following the traditional approach a hierarchical system of customizable controllers has been proposed. For the final system all functionality will be therefore available in a distributed manner, with the possibility of local customization. After a technology survey the open source expert system CLIPS has been chosen as a basis for the implementation of the supervision and the verification system. The CLIPS interpreter has been extended to provide a general control framework. Other ATLAS Online software components have been integrated as plug-ins and provide the mechanism for configuration and communication. Several components have been implemented sharing this technology. The dynamic behavior of the individual component is fully described by th...
In control systems, identifying the system inverse is an important problem. Identification of the system inverse with neural networks is investigated in this dissertation. The system inverse model implemented with a neural network is used as a feed forward controller. An inverse transfer matrix learning scheme is developed. The neural learning control system has learning capability, and compared to a specialized learning scheme and to the feedback error learning scheme, the inverse transfer matrix scheme provides faster convergence. The experience obtained during learning can be used for the execution of quite different tasks. Also the neural control system has a massively parallel distributed processing capability and it is inherently faulttolerant. The generality of the neural control system is demonstrated by successful applications to a pole-balancing control task and to a robot arm motion control task.
Project EOS is studying the problems of building adaptable real-time embedded operating systems for the scientific missions of NASA. Choices (A Class Hierarchical Open Interface for Custom Embedded Systems) is an operating system designed and built by Project EOS to address the following specific issues: the software architecture for adaptable embedded parallel operating systems, the achievement of high-performance and real-time operation, the simplification of interprocess communications, the isolation of operating system mechanisms from one another, and the separation of mechanisms from policy decisions. Choices is written in C++ and runs on a ten processor Encore Multimax. The system is intended for use in constructing specialized computer applications and research on advanced operating system features including faulttolerance and parallelism.
Modern high-performance aircraft demand advanced fault-tolerant flight control strategies. Not only the control effector failures, but the aerodynamic type failures like wing-body damages often result in substantially deteriorate performance because of low available redundancy. As a result the remaining control actuators may yield substantially lower maneuvering capabilities which do not authorize the accomplishment of the air-craft's original specified mission. The problem is to solve the control reconfiguration on available control redundancies when the mission modification is urged to save the aircraft. The proposed robust adaptive fault-tolerant control (RAFTC) system consists of a multi-layer reconfigurable flight controller architecture. It contains three layers accounting for different types and levels of failures including sensor, actuator, and fuselage damages. In case of the nominal operation with possible minor failure(s) a standard adaptive controller stands to achieve the control allocation. This is referred to as the first layer, the controller layer. The performance adjustment is accounted for in the second layer, the reference layer, whose role is to adjust the reference model in the controller design with a degraded transit performance. The upmost mission adjust is in the third layer, the mission layer, when the original mission is not feasible with greatly restricted control capabilities. The modified mission is achieved through the optimization of the command signal which guarantees the boundedness of the closed-loop signals. The main distinguishing feature of this layer is the the mission decision property based on the current available resources. The contribution of the research is the multi-layer fault-tolerant architecture that can address the complete failure scenarios and their accommodations in realities. Moreover, the emphasis is on the mission design capabilities which may guarantee the stability of the aircraft with restricted post-failure control capabilities. The implementation issues of the architecture are also addressed, with possible realizations and the feasibility analysis.
Providing faulttolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing faulttolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application-and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.
We present an economical and fault-tolerant load balancing strategy (EFTLBS) based on an operator replication mechanism and a load shedding method, that fully utilizes the network resources to realize continuous and highly-available data stream processing without dynamic operator migration over wide area networks. In this paper, we first design an economical operator distribution (EOD) plan based on a bin-packing model under the constraints of each stream bandwidth as well as each server's CPU capacity. Next, we devise super-operator (SO) that load balances multi-degree operator replicas. Moreover, for improving the fault-tolerance of the system, we color the SOs based on a coloring bin-packing (CBP) model that assigns peer operator replicas to different servers. To minimize the effects of input rate bursts upon the system, we take advantage of a load shedding method while keeping the QoS guarantees made by the system based on the SO scheme and the CBP model. Finally, we substantiate the utility of our work through experiments on ns-3.
We present an economical and fault-tolerant load balancing strategy (EFTLBS) based on an operator replication mechanism and a load shedding method, that fully utilizes the network resources to realize continuous and highly-available data stream processing without dynamic operator migration over wide area networks. In this paper, we first design an economical operator distribution (EOD) plan based on a bin-packing model under the constraints of each stream bandwidth as well as each server's CPU capacity. Next, we devise super-operator (SO) that load balances multi-degree operator replicas. Moreover, for improving the fault-tolerance of the system, we color the SOs based on a coloring bin-packing (CBP) model that assigns peer operator replicas to different servers. To minimize the effects of input rate bursts upon the system, we take advantage of a load shedding method while keeping the QoS guarantees made by the system based on the SO scheme and the CBP model. Finally, we substantiate the utility of our work through experiments on ns-3.
Peer to Peer (P2P) systems have shown to be a good solution to build self-organized large scale distributed information systems. Within Peer to Peer overlays, Distributed Hash Tables (DHTs) offer strong faulttolerance properties as well as efficient object look-up algorithms. Even if the essence of a P2P system is to provide to each node a restricted local view of the whole system, it is often useful to have some global knowledge of the system, such as knowing the DHT size. The DHT size is relevant, as this information is used to adapt some parameters of the system (the number of replicas of a given object, the depth for epidemic algorithms, the potential number of clients for a given service). Computing the number of nodes in a DHT is a difficult task as it requires the best possible acc...
Various papers on control are presented. The general topics considered include: simulation and computational methods; linear systems and control; control of flexible structures; intelligent control systems; industrial control systems; computer-aided control engineering; robust adaptive control; frequency-domain methods; filtering, estimation, and tracking; optimization of discrete event systems; trajectory control of robot manipulators; digital signal processsing in process control; control of batch processes; robustness of state space models; stable factorization; aircraft and spacecraft guidance; model order reduction; computer networking of real-time control; and advances in automatic control education. Also addressed are: implementation of adaptive and self-tuning controls in machining, eigenvalue/eigenstructure assignment, robust nonlinear control of manipulators, redundant robot control, fault detection, ACES control theory and verification, decentralized control, damage-tolerant flight control systems, neural networks in control, distributed parameter and time-delay systems, and robust stabilization and control.
In this paper, we mix two well-known approaches of the fault-tolerance: robustness and stabilization. Robustness is the aptitude of an algorithm to withstand permanent failures such as process crashes. The stabilization is a general technique to design algorithms tolerating transient failures. Using...
There are various multitudinous faults in Dongpu depression, which are closely related to oil and gas resources; therefore, fault interpretation is of great importance in structure research. The fault interpretation needs to not only consider the pattern of faultsystem but also pay attention to the interpretation skill. In terms of the sectional feature and lateral distribution, the faults in Dongpu depression can be grouped under four heads: graben faultsystem, antithetic faultsystem, synthetic faultsystem and terrace faultsystem. The essential skill for fault interpretation includes two points: the good identification of faulting points on seismic section, and the reasonable link of relevant faulting points on base map. The key measures the authors must take in linking relevant faulting points on base map are (1) the full consideration of aeral geological structure trend (major structure trend being considered first), (2) integrative analysis of base map and sections, (3) the application of fault cutting principle and (4) referring to cross line sections. Besides, drilling data should be used to make good fault interpretation.
Historically, large scale safeguards alarm and communication systems have required the expensive computational power of a mainframe of midsize computer. Due to the widespread availability and reduced cost of PC-based technology, this class of machine is a much preferred solution. This paper will discuss a development program integrating this technology with inexpensive local area network (LAN) hardware to support (1) many touch panel based operator graphics consoles, (2) redundant LAN communications, (3) fault-tolerant LAN communication, (4) redundancy in subsystem failure, (5) modularity in design, (6) fault-tolerant video communication, (7) inexpensive PC-based video annotation and switcher design, (8) inexpensive video replay capability, (9) use of fiber optic communication media, (10) distributed parallel processing, and (11) minimized overall system cost. The Intel BitBus architecture was selected for network communications between PC CPUs. The network supports both fiber optic and copper media and insures message integrity/receival. Custom boards have been developed to transform PCs modular expandable routing switchers with video presence detection and annotation. 1 fig.
Power consumption on big clusters is an important component of the operation cost of the data center, in particular, it increases the total cost of ownership (TCO). In this way, the implementation and usage of power optimization algorithms is crucial for an economic-wise management of the data center. The power manager introduced in this work is based on a dynamic provisioning algorithm autonomically managed that reduces human interactions and makes the system able to run using less energy and respond to errors automatically, as well. In addition, faulttolerance on big data centers is essential for the optimization of its usage and the warranty the coherence and consistency of the data al the process made on that data center. In a cluster whose nodes are dynamically provisioned, and which its topology is managed by a load balancing element, it is possible to take advantage of this feature to replace nodes with issues when free nodes are available. This work presents a faulttolerancesystem that integrates w...
This paper proposes a method for deriving formal specifications of systems. To accomplish this task we pass through a non trivial number of steps, concepts and tools where the first one, the most important, is the concept of method itself, since we realized that computer science has a proliferation of languages but very few methods. We also propose the idea of Layered FaultTolerant Specification (LFTS) to make the method extensible to dependable systems. The principle is layering the specification, for the sake of clarity, in (at least) two different levels, the first one for the normal behavior and the others (if more than one) for the abnormal. The abnormal behavior is described in terms of an Error Injector (EI) which represents a model of the erroneous interference coming from the environment. This structure has been inspired by the notion of idealized faulttolerant component but the combination of LFTS and EI using rely guarantee thinking to describe interference can be considered one of the main contr...
The Network-on-Chip (NoC) is limited by the reliability constraint, which impels us to exploit the fault-tolerant routing. Generally, there are two main design objectives: tolerating more faults and achieving high network performance. To this end, we propose a new multiple-round dimension-order routing (NMR-DOR). Unlike existing solutions, besides the intermediate nodes inter virtual channels (VCs), some turn-legally intermediate nodes inside each VC are also utilized. Hence, more faults are tolerated by those new introduced intermediate nodes without adding extra VCs. Furthermore, unlike the previous solutions where some VCs are prioritized, the NMR-DOR provides a more flexible manner to evenly distribute packets among different VCs. With extensive simulations, we prove that the NMR-DOR maximally saves more than 90% unreachable node pairs blocked by faults in previous solutions, and significantly reduces the packet latency compared with existing solutions.
Understanding fault zone properties in different geological settings is important to better assess the development and propagation of faults. In addition this allows better evaluation and permeability estimates of potential fault-related geothermal reservoirs. The Leinetalgraben faultsystem provides an outcrop analogue for many fault zones in the subsurface of the North German Basin. The Leinetalgraben is a N-S-trending graben structure, initiated in the Jurassic, in the south of Lower Saxony and as such part of the North German Basin. The faultsystem was reactivated and inverted during Alpine compression in the Tertiary. This complex geological situation was further affected by halotectonics. Therefore we can find different types of fault zones, that is normal, reverse, strike-slip an oblique-slip faults, surrounding the major Leinetalgraben boundary faults. Here we present first results of structural geological field studies on the geometry and architecture of fault zones in the Leinetalgraben FaultSystem in outcrop-scale. We measured the orientations and displacements of 17 m-scale fault zones in limestone (Muschelkalk) outcrops, the thicknesses of their fault cores and damage zones, as well as the fracture densities and geometric parameters of the fracture systems therein. We also analysed the effects of rock heterogeneities, particularly stiffness variations between layers (mechanical layering) on the propagation of natural fractures and fault zones. The analysed fault zones predominantly show similar orientations as the major fault zones they surround. Other faults are conjugate or perpendicular to the major fault zones. The direction of predominant joint strike corresponds to the orientation of the fault zones in the majority of cases. The mechanical layering of the limestone and marlstone stratification obviously has great effects on fracture propagation. Already thin layers (mm- to cm-scale) of low stiffness - here marl - seem to suffice to change the local stress field so that it stops many joints. Well developed fracture networks are therefore in most cases limited to single layers. From the data we finally determined the structural indices of the fault zones, that is, the ratios of damage zone and fault zone widths. By their nature structural indices can obtain values from 0 to 1; the values having implications for fault zone permeability. An ideal value of 0 would mean that a fault damage zone is absent. Such fault zones generally have low permeabilities as long as the faults are not active (slipping). A structural index of 1, however, would imply that there is practically no fault core and the fault zone permeability is entirely controlled by the fractures within the damage zone. Our measurements show that the damage zones of normal faults in the Muschelkalk limestone are relatively thick so that their structural indices are relatively high. In contrast to normal faults, reverse and strike-slip faults have smaller indices because of well developed brecciated fault cores. In addition we found that small-scale fault zones with parallel orientations to the major Leinetalgraben fault zones are more likely to have well developed damage zones than those with conjugate or perpendicular orientation. Our field data lead to the hypothesis that faultsystems in the North German Basin may generally be surrounded by small-scale fault zones which have high permeabilities if orientated parallel to the major fault and lower permeabilities if conjugate or perpendicularly orientated. However, further studies of faultsystems in different geological settings are needed to support or reject this hypothesis. Such studies help to improve the general understanding of fault zones and faultsystems and thereby minimise the risk in matters of the exploitation of fault-related geothermal reservoirs.
Imaging of the sea floor offshore from Hartland Point (north Devon, U.K.), using high resolution multibeam bathymetry, reveals a strike-slip fault network. This consists of NE-trending left-lateral faults and NW-trending right-lateral faults that cut folded and steeply dipping strata (~ 60°). Faults were accurately mapped using the multibeam imagery, and lateral separations of marker beds measured along fault traces. These data are used to examine the spatial arrangement, fault displacement, and strain distribution within the network at different displacement cut-offs.At high displacement cut-offs, the fault network is dominated by a few long isolated right-lateral fault segments that bound fault blocks, but at lower displacement cut-offs shorter left-lateral and right-lateral fault segments make up fault tips and infill fault blocks. The majority (70%) of fault trace-length is taken up by small fault segments that have fault segments with > 10 m displacement. The topology and relative connectivity of the network is analysed in terms of a system of fault branches between tips (I-nodes) or intersections (X or Y-nodes), the relative proportions of which reflect the connectivity of the network. Although the kinematic behaviour of the fault network is controlled by large fault segments, connectivity is very dependent on the small fault segments.A comparison with a similar, nearby, strike-slip fault network at Westward Ho! (north Devon) shows many similarities and indicates that fault networks are better connected with increasing strain and that the network becomes better connected when strain is localized within damage zones rather than on individual faults.
Field programmable gate arrays (FPGAs) are widely used in reliability-critical systems due to their reconfiguration ability. However, with the shrinking device feature size and increasing die area, nowadays FPGAs can be deeply affected by the errors induced by electromigration and radiation. To improve the reliability of FPGA-based reconfigurable systems, a permanent fault recovery approach using a domain partition model is proposed in this paper. In the proposed approach, the fault-tolerant FPGA recovery from faults is realized by reloading a proper configuration from a pool of multiple alternative configurations with overlaps. The overlaps are presented as a set of vectors in the domain partition model. To enhance the reliability, a technical procedure is also presented in which the set of vectors are heuristically filtered so that the corresponding small overlaps can be merged into big ones. Experimental results are provided to demonstrate the effectiveness of the proposed approach through applying it to several benchmark circuits. Compared with previous approaches, the proposed approach increased MTTF by up to 18.87%.
In this research, we first define protocol subsets for SMART(System-integrated Modular Advanced Reactor) communication network based on the requirement of SMART MMIS transmission delay and traffic requirements and OSI(Open System Interconnection) 7 layers' network protocol functions. Also, current industrial purpose LAN protocols are analyzed and the applicability of commercialized protocols are checked. For the suitability test, we have applied approximated SMART data traffic and maximum allowable transmission delay requirement. With the simulation results, we conclude that IEEE 802.5 and FDDI which is an ANSI standard, is the most suitable for SMART. We further analyzed the FDDI and token ring protocols for SMART and nuclear plant network environment including IEEE 802.4, IEEE 802.5, and ARCnet. The most suitable protocol for SMART is FDDI and FDDI MAC and RMT protocol specifications have been verified with LOTOS and the verification results show that FDDI MAC and RMT satisfy the reachability and liveness, but does not show deadlock and livelock. Therefore, we conclude that FDDI MAC and RMT is highly reliable protocol for SMART MMIS network. After that, we consider the stacking fault of IEEE 802.5 token ring protocol and propose a faulttolerant MAM(Modified Active Monitor) protocol. The simulation results show that the MAM protocol improves lower priority traffic service rate when stacking fault occurs. Therefore, proposed MAM protocol can be applied to SMART communication network for high reliability and hard real-time communication purpose in data acquisition and inter channel network. (author). 37 refs., 79 figs., 39 tabs.
In this review article, we compare the performance of two computing systems: quantum computing and coherent computing. A layered architecture for circuit-model quantum computing, employing surface code quantum error correction, has been recently discussed. Using this concrete hardware platform, it is possible to provide resource analysis for executing the fault-tolerent quantum computing for prime number factoring and molecular eigen-energy calculation that cannot be solved by the present day computing systems. A particular quantum computing system could solve such problems on the time scale of 1-10 days by using 108 ??? 109 physical qubits. We discuss an alternative computing system based on an injection-locked laser network wnicn is called a coherent computing system here. A three-dimens...
In this paper a scheme is presented for preventing violations of control signal constraints in a class of coupled systems. The scheme is an add-on solution to the existing control system; it works like a faulttolerant scheme, by accommodating the problem then occurring. The proposed scheme recomputes the reference values to the system such that control signal constraint violations are avoided. The new reference values are found using an energy balance of the system. The scheme is intended to handle rarely occurring constraint violations, so the only concern is that the system should be stable and not to optimize performance during all conditions. The scheme is applied to an example with a coal mill pulverizing coal for a power plant.
The Second Generation (Gen II) control system for the F-15 Intelligent Flight Control System (IFCS) program implements direct adaptive neural networks to demonstrate robust tolerance to faults and failures. The direct adaptive tracking controller integrates learning neural networks (NNs) with a dynamic inversion control law. The term direct adaptive is used because the error between the reference model and the aircraft response is being compensated or directly adapted to minimize error without regard to knowing the cause of the error. No parameter estimation is needed for this direct adaptive control system. In the Gen II design, the feedback errors are regulated with a proportional-plus-integral (PI) compensator. This basic compensator is augmented with an online NN that changes the system gains via an error-based adaptation law to improve aircraft performance at all times, including normal flight, system failures, mispredicted behavior, or changes in behavior resulting from damage.
Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent faulttolerance, efficient allocation of resources, and support for all the needs of parallel computing: parallel I/O, deterministic behavior; and responsiveness. Meeting these requirements with commodity hardware and operating systems is difficult because they were not designed to support global management of a large-scale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, inspired by concepts from the BSP and SIMD computational models, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.
The control mixer concept is efficient in improving an ordinary control system into a faulttolerant one, especially for these control systems of which the real-time and on-line redesign of the control laws is very difficult. In order to consider the stability, performance and robustness of the reconfigurated system simultaneously, and to deal with a more general controller reconfiguration than the static feedback mechanism by using the control mixer approach, the robust control mixer module method is proposed in this paper. The form of the control mixer module extends from a static gain matrix into a LTI dynamical system, and furthermore multiple dynamical control mixer modules can be employed in our consideration. The H_{\\infty} control theory is used for the analysis and design of the robust control mixer modules. Finally, one practical robot arm system as benchmark is used to test the proposed method.
Modern satellite based experiments are often very complex real-time systems, composed by flight and ground segments, that have challenging resource related constraints, in terms of size, weight, power, requirements for real-time response, faulttolerance, and specialized input/output hardware-software, and they must be certified to high levels of assurance. Hardware-software data processing systems have to be responsive to system degradation and to changes in the data acquisition modes, and actions have to be taken to change the organization of the mission operations. A big research & develop effort in a team composed by scientists and technologists can lead to produce software systems able to optimize the hardware to reach very high levels of performance or to pull degraded hardware to maintain satisfactory features. We'll show real-life examples describing a system, processing the data of a X-Ray detector on satellite-based mission in spinning mode.
In this paper, we present a new fault-tolerant, large-scale star network scheme called Scalable Autonomous Fault-tolerant Ethernet (SAFE). The primary goal of a SAFE scheme is to provide network scalability and autonomous fault detection and recovery. SAFE divides a large-scale, mission-critical network, such as the naval combatant network, into several subnets by limiting the number of nodes in each subnet. This network can be easily configured as a star network in order to meet fault recovery time requirements. For SAFE, we developed a novel mechanism for inter-subnet fault detection and recovery; a conventional Ethernet-based heartbeat mechanism is used in each subnet. Theoretical and experimental performance analyses of SAFE in terms of fail-over time were conducted under various network failure scenarios. The results validate our scheme.
In this paper we present an approach to the synthesis of fault-tolerant schedules for embedded applications with soft and hard real-time constraints. We are interested to guarantee the deadlines for the hard processes even in the case of faults, while maximizing the overall utility. We use time/utility functions to capture the utility of soft processes. Process re-execution is employed to recover from multiple faults. A single static schedule computed off-line is not faulttolerant and is pessimistic in terms of utility, while a purely online approach, which computes a new schedule every time a process fails or completes, incurs an unacceptable overhead. Thus, we use a quasi-static scheduling strategy, where a set of schedules is synthesized off-line and, at run time, the scheduler will select the right schedule based on the occurrence of faults and the actual execution times of processes. The proposed schedule synthesis heuristics have been evaluated using extensive experiments.
The role of pull aparts and pushups in transcurrent systems, the rotation of faults and blocks within transcurrent faultsystems, the role of accretion tectonics in plate boundary deformation, and power law creep behavior and the yielding at plate boundar...
...necessary to recover normal performance on this engine...no direct fuel systemfault has been identified...this will not affect normal standby pump operation...necessary to recover normal performance on this engine...no direct fuel systemfault has been...
Parallel converters in wind turbine give a number advantages, such as faulttolerance due to the redundant converters. However, it might be difficult to isolate gain faults in one of the converters if only a combined power measurement is available. In this paper a scheme using orthogonal power references to the converters is proposed. Simulations on a wind turbine with 5 parallel converters show a clear potential of this scheme for isolation of this gain fault to the correct converter in which the fault occurs.
The fast Fourier transform (FFT) has long been a major analytical tool in such diverse fields as system analysis, digital filtering, power spectrum analysis, and communication theory. With the advent of VLSI technology, large collections of processing elements, which cooperate with each other to achieve high-speed computation, have become economically feasible. Since any functional error in a high-performance system may seriously jeopardize the operation of the system and its data integrity, some level of faulttolerance must be incorporated in order to ensure that the results of long computations are valid. In this paper, two concurrent error detection (CED) schemes are proposed for N-point FFT networks which consists of log(/sub 2/) N stages with (N/2) two-point butterfly modules for each stage. The method assumes that failures are confined to a single complex multiplier or adder or one input or output set of lines. Such a fault model covers a broad class of faults. It is shown that only a small overhead ratio-0(2/log(/sub 2/) N) of hardware-is required for the FFT networks to obtain fault-secure results in the first scheme. A new data retry technique is used to locate the faulty modules. Large roundoff errors can also be detected and treated in the same manner as functional errors. However, the data retry technique can also distinguish between the roundoff errors and functional errors which are caused by some physical failures. In the second scheme, a time redundancy methods is used to achieve both error detection and location. It is shown that only negligible hardware overhead is required. However, the throughput is reduced to half compared to that of the original system, without both error detection and location due to the nature of time redundancy methods.
Because of its high efficiency, robustness, and high power density, a permanent magnet synchronous machine (PMSM) is a desirable choice for high-performance applications, such as naval shipboard power systems. The stator winding fault is the most common electrical fault in PMSM; thus, detection of this type of fault is very important. The objective of this paper is to model and detect the location and severity of the stator winding fault of PMSM. To achieve this objective, a mathematical model that can describe both healthy and fault conditions is developed. Simulation results match the observations of this type of fault in the literature. According to the fault model, two parameters associated with fault location and fault severity must be identified in order to detect the fault. Because ...
The fault model of the nonlinear system and the generalized nonlinear observer are presented in this paper. The algorithm compensates the influence of faults on the remnant order by means of adjusting the gain matrix of the observer to track the fault parameters of the system on line. It is proved that the nonlinear observer can realize fault diagnosis. In fact, here the problem of fault diagnosis is converted to the problem of model matching. The fault estimation can be exactly and timely. At last, experiments of three-tank system illustrate the effectiveness of the proposed algorithm.
This paper is concerned with the application of fuzzy neural networks to fault diagnosis systems for rotary machines. In practical fault diagnosis, it is very difficult to improve the recognition rate of pattern recognition, especially when the sample data are similar. To solve these difficulties, a fault diagnosis system using fuzzy neural networks is proposed in this research. A fault diagnosis system with fuzzy neural networks is based on a series of standard fault pattern pairings between fault symptoms and fault. Fuzzy neural networks are trained to memorize these standard pattern pairs. Unlike other neural networks, fuzzy neural networks adopt bi-directional association. They make use of information from both the fault symptoms and the fault patterns, which can improve recognition rate greatly. When an unknown sample becomes the input for a trained fault diagnosis system, the fault diagnosis system can make fault diagnosis by bi-directional association of fuzzy neural networks. Through experiments with a rotor testing table and applications in monitoring and fault diagnosis of water pump sets of oil plant, it is verified that fuzzy neural networks have a well distinguished ability and are effective to perform fault diagnosis of rotary machines.
Most of the large earthquakes (magnitude greater than 8.0) observed in Japan fall into the subduction plate- boundary category. Based on the results of previous Nankai Trough research efforts, further research opportunities have been proposed under the umbrella of the IODP scientific drilling proposal 603 (NanTroSEIZE: Nankai Trough Seismogenic Zone Experiment) ranked as the top level proposal in IODP. IODP proposal 603 not only proposes drilling, coring and geological analysis, and geophysical logging, but also mandates that a long- term borehole monitoring system be installed into two deep riser holes at about 3,500 m and about 6,000 m below sea floor (mbsf), where we expect to encounter the mega-splay and the locked region of mega thrust fault, respectively. The first riser target (NT2-03 site) is expected to drill through five potential splay faults above 3,500 mbsf. We plan to install sensors to monitor strain, tilt and optionally pore pressure for crustal deformation at and between splay faults, to monitor seismometer array for micro and slow earthquakes detection and for seismic microstructures, and to monitor pore pressure and temperature for hydrologic state change at the fault during interseismic period. The major technical features to develop the deep ocean borehole observatory for NT2-03A are mainly as follows; 1) high temperature (125°E#8249;C), 2) long life (5 years), 3) deployment (15,000 psi wellhead system, deep well, retrieval, perforation, packer, mechanical shock), 4) coupling to formation (cement, clamp), 5) multi level monitoring (against 5 spray faults), 6) multi purpose monitoring (seismic, geodetic, hydrogeologic), 7) low power consumption, 8) real time monitoring (connecting to sea bed cable), 9) accurate synchronization, 10) wide frequency range / high dynamic range ADC, 11) down sizing (installing into 9-5/8" casing with tubing), 12) system redundancy (faulttolerant). We started to develop an experimental prototype (EXP) for field test using the borehole on land from 2007, and plan to carry out the field test in 2009. The objectives of EXP are to collect real data to determine if the system works as designed and to identify all unseen problems in the design. The implementation phase will be followed to improve the system based on the field test results, upgrade the design, and produce the Engineering Prototype (ENP) to be installed in NT2-03A in 2011.
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the highthroughput center storage system, scratch space. However, the scratch space is typically managed using simple purge policies, without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center s purge and users delivery deadlines, unable to adapt to changing dynamics in the end-toend data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4% for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user Service Level Agreements.
Insects are the most successful group of living things in terms of the number of species, the biomass and their distribution. Entomological research has revealed that the insect sensory systems are crucial for their success. Compared to human brains, the insect central nerve systems are extremely primitive and simple, both structurally and functionally, and are of minimal learning ability. Faced with these constraints, insects have evolved a set of extremely effective sensory systems that are structurally simple, functionally versatile and powerful, and highly distributed, as well as noise and faulttolerant. As a result, in recent years insect sensory systems have been inspirational to new communications and computing paradigms, which have lead to significant advances. However, we believe...
Reliable systems have always been built out of unreliable components. Early on, the reliable components were small such as mirrored disks or ECC (Error Correcting Codes) in core memory. These systems were designed such that failures of these small components were transparent to the application. Later, the size of the unreliable components grew larger and semantic challenges crept into the application when failures occurred. As the granularity of the unreliable component grows, the latency to communicate with a backup becomes unpalatable. This leads to a more relaxed model for faulttolerance. The primary system will acknowledge the work request and its actions without waiting to ensure that the backup is notified of the work. This improves the responsiveness of the system. There are two implications of asynchronous state capture: 1) Everything promised by the primary is probabilistic. There is always a chance that an untimely failure shortly after the promise results in a backup proceeding without knowledge o...
A new tree-structured-architecture (TSA) is proposed. The TSA is object-oriented, implements the notions of capability-based-addressing and the single-level-store, and it is particularly designed to narrow the semantic gap. It encourages modular programming and directly supports the concepts of tasks and intertask communication, making it particularly suitable for multiprocessing and multiprogramming implementation. The TSA is implemented on a multiresource, distributed, matrix-structured, reconfigurable, fault-tolerant computing system, proposed in this article. The system implements some ideas of the dynamic and the flexible architecture. The system possesses a high degree of regularity by virtue of implementing large amounts of identical units. The hardware is modular, permitting the addition or removal of groups of units without a change in the system's software. 8 references.
Computational concurrency has been with us for some time and is here to stay, particularly in the domain of distributed systems and fault-tolerant computers. Processes executing concurrently in such systems communicate in order to exchange information and to synchronise the activities which they perform. Classical interprocess synchronisation mechanisms, based on shared variables and semaphores, are neither efficient nor methodically sound; nor do they produce correct solutions when dealing with autonomous processes running on different processors. The purpose of this paper is to define the problem, submit the arguments against using shared variables and semaphores for interprocess synchronisation in distributed systems, and to present a mechanism for handling interprocess interaction in such systems. 15 references.
The design of a fault-tolerant distributed, real-time, embedded system with safety-critical concerns requires the use of formal languages. In this paper, we present the foundations of a new software engineering method for real-time systems that enables the integration of semiformal and formal notations. This new software engineering method is mostly based upon the ??COntinuuM?? co-modeling methodology that we have used to integrate architecture models of real-time systems (Perseil and Pautet in 12th International conference on engineering of complex computer systems, ICECCS, IEEE Computer Society, Auckland, pp 371?376, 2007) (so we call it ?Method C??), and a model-driven development process (ISBN 978-0-387-39361-2 in: From model-driven design to resource management for distributed embedde...
We examine a model of classical deterministic computing in which the ground state of the classical system is a spatial history of the computation. This model is relevant to quantum dot cellular automata as well as to recent universal adiabatic quantum computing constructions. In its most primitive form, systems constructed in this model cannot compute in an error free manner when working at non-zero temperature. However, by exploiting a mapping between the partition function for this model and probabilistic classical circuits we are able to show that it is possible to make this model effectively error free. We achieve this by using techniques in fault-tolerant classical computing and the result is that the system can compute effectively error free if the temperature is below a critical temperature. We further link this model to computational complexity and show that a certain problem concerning finite temperature classical spin systems is complete for the complexity class Merlin-Arthur. This provides an inter...
Highly nonlinear behavior of a system of discrete sites on a lattice is observed when a specific feedback loop is introduced into models employing coupled map lattices, quantum cellular automata, or the real-valued analogues of the latter. It is shown that the combination of two operations, i.e. i) enhancement of a site's value when fulfilling a feedback condition and ii) normalization of the system after each time step, produces relatively short-lived spatio-temporal patterns whose mean lifetime can be considered as emergent order parameter of the system. This mean lifetime obeys a scaling law involving a control parameter which tunes the "faulttolerance" of the feedback condition. Thus, within appropriate ranges of the systems variables, the dynamical properties can be characterized by a "fractal evolution dimension" (as opposed to a "fractal dimension").
NASA is pursuing a program in Advanced Subsonic Transport (AST) to develop the technology for a highly reliable Fly-By-Light/Power-By-WIre aircraft. One of the primary objectives of the program is to develop the technology base for confident application of integrated PBW components and systems to transport aircraft to improve operating reliability and efficiency. Technology will be developed so that the present hydraulic and pneumatic systems of the aircraft can be systematically eliminated and replaced by electrical systems. These motor driven actuators would move the aircraft wing surfaces as well as the rudder to provide steering controls for the pilot. Existing aircraft electrical systems are not flight critical and are prone to failure due to Electromagnetic Interference (EMI) (1), ground faults and component failures. In order to successfully implement electromechanical flight control actuation, a Power Management and Distribution (PMAD) System must be designed having a reliability of 1 failure in 10(exp +9) hours, EMI hardening and a faulttolerance architecture to ensure uninterrupted power to all aircraft flight critical systems. The focus of this paper is to analyze, define, and describe technically challenging areas associated with the development of a Power By Wire Aircraft and typical requirements to be established at the box level. The authors will attempt to propose areas of investigation, citing specific military standards and requirements that need to be revised to accommodate the 'More Electric Aircraft Systems'.
Formal robustness analysis of aircraft control upset prevention and recovery systems could play an important role in their validation and ultimate certification. Such systems developed for failure detection, identification, and reconfiguration, as well as upset recovery, need to be evaluated over broad regions of the flight envelope or under extreme flight conditions, and should include various sources of uncertainty. To apply formal robustness analysis, formulation of linear fractional transformation (LFT) models of complex parameter-dependent systems is required, which represent system uncertainty due to parameter uncertainty and actuator faults. This paper describes a detailed LFT model formulation procedure from the nonlinear model of a transport aircraft by using a preliminary LFT modeling software tool developed at the NASA Langley Research Center, which utilizes a matrix-based computational approach. The closed-loop system is evaluated over the entire flight envelope based on the generated LFT model which can cover nonlinear dynamics. The robustness analysis results of the closed-loop faulttolerant control system of a transport aircraft are presented. A reliable flight envelope (safe flight regime) is also calculated from the robust performance analysis results, over which the closed-loop system can achieve the desired performance of command tracking and failure detection.
The purpose of this project is to explore the feasibility of automating the verification process for computer systems. The intent is to demonstrate that both the software and hardware that comprise the system meet specified availability and reliability criteria, that is, total design analysis. The approach to automation is based upon the use of Automated Reasoning Software developed at Argonne National Laboratory. This approach is herein referred to as formal analysis and is based on previous work on the formal verification of digital hardware designs. Formal analysis represents a rigorous evaluation which is appropriate for system acceptance in critical applications, such as a Reactor Safety System (RSS). This report describes a formal analysis technique in the context of a case study, that is, demonstrates the feasibility of applying formal analysis via application. The case study described is based on the Reactor Safety System (RSS) for the Experimental Breeder Reactor-II (EBR-II). This is a system where high reliability and availability are tantamount to safety. The conceptual design for this case study incorporates a Fault-Tolerant Processor (FTP) for the computer environment. An FTP is a computer which has the ability to produce correct results even in the presence of any single fault. This technology was selected as it provides a computer-based equivalent to the traditional analog based RSSs. This provides a more conservative design constraint than that imposed by the IEEE Standard, Criteria For Protection Systems For Nuclear Power Generating Stations (ANSI N42.7-1972).
The purpose of this project is to explore the feasibility of automating the verification process for computer systems. The intent is to demonstrate that both the software and hardware that comprise the system meet specified availability and reliability criteria, that is, total design analysis. The approach to automation is based upon the use of Automated Reasoning Software developed at Argonne National Laboratory. This approach is herein referred to as formal analysis and is based on previous work on the formal verification of digital hardware designs. Formal analysis represents a rigorous evaluation which is appropriate for system acceptance in critical applications, such as a Reactor Safety System (RSS). This report describes a formal analysis technique in the context of a case study, that is, demonstrates the feasibility of applying formal analysis via application. The case study described is based on the Reactor Safety System (RSS) for the Experimental Breeder Reactor-II (EBR-II). This is a system where high reliability and availability are tantamount to safety. The conceptual design for this case study incorporates a Fault-Tolerant Processor (FTP) for the computer environment. An FTP is a computer which has the ability to produce correct results even in the presence of any single fault. This technology was selected as it provides a computer-based equivalent to the traditional analog based RSSs. This provides a more conservative design constraint than that imposed by the IEEE Standard, Criteria For Protection Systems For Nuclear Power Generating Stations (ANSI N42.7-1972).
Three data-base architectures may be distinguished among Picture Archiving and Communication Systems (PACSs): (1) Configurations with logically and physically centralized data- base and file server, (2) systems with physically distributed file servers and a logically centralized data-base, and (3) installations with logically and physically distributed data- bases and file servers. A brief overview of these architectures and their scaleability, performance, and fault- tolerance is given. A PACS for an existing large university hospital is designed for the first as well as the second architecture using given image production data and workflow. We evaluate the fault-tolerance of the two architectures. By modeling the work-flow and employing queuing theory, solutions with practically realizable data transfer requirements are found for both architectures. With today's performance and cost of computers, storage, and information management technologies, the second and third architectures are preferably implemented, depending on the size of the installation. The architectures offer almost unlimited scaleability, very high fault-tolerance, and optimized workflow. We describe a modern commercial PACS that adheres to the open-systems concept and consists of software application programs that run, independent of specific computer and network components, on off-the-shelf hardware and under standard multi-platform operating systems and utilize commercial data-base management systems and network managers. The system is based on the second architecture with multiple islands of functionality, each with servers and archive modules and a physically distributed data-base. Our PACS architecture supports browser technology: Workstations use the data-base to determine the location of needed information and then, through the image browser, mount the appropriate file server for access. The architecture supports a concept similar to domain name server (DNS) directory services on the Internet. The system can be expanded to enterprise-wide installations with a logically distributed data-base. Openness, scaleability, and longevity of a PACS also strongly depend on the architecture of software applications in the operating and tool-set environment as well as on the distribution of image processing tasks across a PACS. These issues are discussed in the last section of our paper. We are presenting an image processing strategy that provides a consistent rendering of image gray-scale and spatial resolution throughout the entire PACS.
The traditional urban public transport system generally cannot provide an effective access service for people with disabilities, especially for disabled, wheelchair and blind (DWB) passengers. In this paper, based on advanced information & communication technologies (ICT) and green technologies (GT) concepts, a dedicated public urban transportation service access system named Mobi+ has been introduced, which facilitates the mobility of DWB passengers. The Mobi+ project consists of three subsystems: a wireless communication subsystem, which provides the data exchange and network connection services between buses and stations in the complex urban environments; the bus subsystem, which provides the DWB class detection & bus arrival notification services; and the station subsystem, which implements the urban environmental surveillance & bus auxiliary access services. The Mobi+ card that supports multi-microcontroller multi-transceiver adopts the fault-tolerant component-based hardware architecture, in which the dedicated embedded system software, i.e., operating system micro-kernel and wireless protocol, has been integrated. The dedicated Mobi+ embedded system provides the fault-tolerant resource awareness communication and scheduling mechanism to ensure the reliability in data exchange and service provision. At present, the Mobi+ system has been implemented on the buses and stations of line '2' in the city of Clermont-Ferrand (France). The experiential results show that, on one hand the Mobi+ prototype system reaches the design expectations and provides an effective urban bus access service for people with disabilities; on the other hand the Mobi+ system is easily to deploy in the buses and at bus stations thanks to its low energy consumption and small form factor. PMID:23112622
The traditional urban public transport system generally cannot provide an effective access service for people with disabilities, especially for disabled, wheelchair and blind (DWB) passengers. In this paper, based on advanced information & communication technologies (ICT) and green technologies (GT) concepts, a dedicated public urban transportation service access system named Mobi+ has been introduced, which facilitates the mobility of DWB passengers. The Mobi+ project consists of three subsystems: a wireless communication subsystem, which provides the data exchange and network connection services between buses and stations in the complex urban environments; the bus subsystem, which provides the DWB class detection & bus arrival notification services; and the station subsystem, which implements the urban environmental surveillance & bus auxiliary access services. The Mobi+ card that supports multi-microcontroller multi-transceiver adopts the fault-tolerant component-based hardware architecture, in which the dedicated embedded system software, i.e., operating system micro-kernel and wireless protocol, has been integrated. The dedicated Mobi+ embedded system provides the fault-tolerant resource awareness communication and scheduling mechanism to ensure the reliability in data exchange and service provision. At present, the Mobi+ system has been implemented on the buses and stations of line ‘2’ in the city of Clermont-Ferrand (France). The experiential results show that, on one hand the Mobi+ prototype system reaches the design expectations and provides an effective urban bus access service for people with disabilities; on the other hand the Mobi+ system is easily to deploy in the buses and at bus stations thanks to its low energy consumption and small form factor. PMID:22319358
In Japan, the empirical formula proposed by Matsuda (1975) mainly based on the length of the historical surface fault ruptures and magnitude, is generally applied to estimate the size of future earthquakes from the extent of existing active faults for seismic hazard assessment. Therefore validity of the active fault length and defining individual segment boundaries where propagating ruptures terminate are essential and crucial to the reliability for the accurate assessments. It is, however, not likely for us to clearly identify the behavioral earthquake segments from observation of surface faulting during the historical period, because most of the active faults have longer recurrence intervals than 1000 years in Japan. Besides uncertainties of the datasets obtained mainly from fault trenching studies are quite large for fault grouping/segmentation. This is why new methods or criteria should be applied for active fault grouping/segmentation, and one of the candidates may be geometric criterion of active faults. Matsuda (1990) used _gfive kilometer_h as a critical distance for grouping and separation of neighboring active faults. On the other hand, Nakata and Goto (1998) proposed the geometric criteria such as (1) branching features of active fault traces and (2) characteristic pattern of vertical-slip distribution along the fault traces as tools to predict rupture length of future earthquakes. The branching during the fault rupture propagation is regarded as an effective energy dissipation process and could result in final rupture termination. With respect to the characteristic pattern of vertical-slip distribution, especially with strike-slip components, the up-thrown sides along the faults are, in general, located on the fault blocks in the direction of relative strike-slip. Applying these new geometric criteria to the high-resolution active fault distribution maps, the fault grouping/segmentation could be more practically conducted. We tested this model successfully on the active faults generated the 1943 Tottori earthquake, the Chojagahara-Yoshii fault zone in Chugoku district in southwest Japan, as well as the active faultsystem in northern Luzon, the Philippines. Thus, we name this conceptual model as _gPackaged Fault Model_h and call the active faults grouped by the model as _gPackaged Faults_h for individual earthquake source faults. Moreover, we come to know that active fault mapping with _gPackaged Fault Model_h in mind enables us to find many new active fault traces (e.g., the Shigenobu fault along the MTL in Japan).
We present a faulttolerant control strategy based on a new principle for actuator fault diagnosis. The scheme employs a standard bank of observers which match the different fault situations that can occur in the plant. Each of these observers has an associated estimation error with distinctive dynamics when an estimator matches the current fault situation of the plant. Based on the information from each observer, a fault detection and isolation (FDI) module is able to reconfigure the control loop by selecting the appropriate control law from a bank of controllers, each of them designed to stabilise and achieve reference tracking for one of the given fault models. The main contribution of this article is to propose a new FDI principle which exploits the separation of sets that characterise...
In this paper, we analyze faulttolerance properties of the Majority Gate, as the main logic gate for implementation with Quantum dots Cellular Automata (QCA), in terms of fabrication defect. Our results demonstrate the poor faulttolerance properties of the conventional design of Majority Gate and thus the difficulty in its practical application. We propose a new approach to the design of QCA-based Majority Gate by considering two-dimensional arrays of QCA cells rather than a single cell for the design of such a gate. We analyze faulttolerance properties of such Block Majority Gates in terms of inputs misalignment and irregularity and defect (missing cells) in assembly of the array. We present simulation results based on semiconductor implementation of QCA with an intermediate dimensional dot of about 5 nm in size as opposed to magnetic dots of greater than 100 nm or molecular dots of 2-5A. Our results clearly demonstrate the superior faulttolerance properties of the Block Majority Gate and its greater potential for a practical realization. We also show the possibility of designing faulttolerant QCA circuits by using Block Majority Gates.
The authors address the requirements, benefits, and mitigation of risks to adapt a commercial Hexad fault-tolerant inertial navigation/global positioning system (FT IN/GPS) for use in next-generation spacecraft. Next-generation requirements are examined to determine whether a high production base system can meet autonomous, reliable, and low-cost requirements for future spacecraft. The major benefits are the combining and replacement of functions, the reduction of unscheduled maintenance and operations costs, and a higher probability of mission success. The design, development, and production risks are mitigated by the long-term commercial production schedule for the Boeing 777 air data inertial reference unit (ADIRU) which begins in the mid-1990s. The conclusion is that a strapdown ring laser gyro (RLG) Hexad FT IN/GPS is the preferred integrated navigation and control system for next-generation vehicles.
Under a NASA-sponsored technology development project, a multi-disciplinary team consisting of industry, academia, and government organizations led by Hamilton Sundstrand is developing an amine based humidity and carbon dioxide (CO2) removal process and prototype equipment for Vision for Space Exploration (VSE) applications. This system employs thermally linked amine sorbent beds operating as a pressure swing adsorption system, using the vacuum of space for regeneration. The prototype hardware was designed based on a two faulttolerant requirement, resulting in a single system that could handle the metabolic water and carbon dioxide load for a crew size of six. Two, full scale prototype hardware sets, consisting of a linear spool valve, actuator and amine sorbent canister, have been manufactured, tested, and subsequently delivered to NASA JSC. This paper presents a summary of the hardware configuration, operation, and performance in addition to current follow-on activity including a new valve design and trace contaminant removal evaluation.
This paper describes an on going R&D project to develop, design, install, and assess the field performance of an advanced substation alarm system. SAMS provides a highly fault-tolerantsystem for the reporting of equipment alarms. SAMS separates and identifies each of the multiple alarm contacts, transmits an alarm condition over existing substation two-wire system, and displays the alarm source, and its associated technical information, on a touch-screen monitor inside the substation control room, and a remote central location and on a hand held terminal which may be carried anywhere within the substation. SAMS is currently installed at the Sherman Creek substation in the Bronx for the purpose of a three month field evaluation.
System-on-a-chip or system on chip (SoC or SOC) refers to integrating all components of a computer or other electronic system into a single integrated circuit (chip). It may contain digital, analog, mixed-signal, and often radio-frequency functions all on a single chip substrate. Complexity drives it all: Radiation tolerance and testability are challenges for fault isolation, propagation, and validation. Bigger single silicon die than flown before and technology is scaling below 90nm (new qual methods). Packages have changed and are bigger and more difficult to inspect, test, and understand. Add in embedded passives. Material interfaces are more complex (underfills, processing). New rules for board layouts. Mechanical and thermal designs, etc.
We discuss the potential and limitations of Gaussian cluster states for measurement-based quantum computing. Using a framework of Gaussian-projected entangled pair states, we show that no matter what Gaussian local measurements are performed on systems distributed on a general graph, transport and processing of quantum information are not possible beyond a certain influence region, except for exponentially suppressed corrections. We also demonstrate that even under arbitrary non-Gaussian local measurements, slabs of Gaussian cluster states of a finite width cannot carry logical quantum information, even if sophisticated encodings of qubits in continuous-variable systems are allowed for. This is proven by suitably contracting tensor networks representing infinite-dimensional quantum systems. The result can be seen as sharpening the requirements for quantum error correction and faulttolerance for Gaussian cluster states and points toward the necessity of non-Gaussian resource states for measurement-based quantum computing. The results can equally be viewed as referring to Gaussian quantum repeater networks.
When a High Performance Storage System's mover shuttles large amounts of data to storage over a single Ethernet device that single channel can rapidly become saturated. Using Linux Ethernet channel bonding to address this and similar situations was not, until now, a viable solution. The various modes in which channel bonding could be configured always offered some benefit but only under strict conditions or at a system resource cost that was greater than the benefit gained by using channel bonding. Newer bonding modes designed by various networking hardware companies, helpful in such networking scenarios, were already present in their own switches. However, Linux-based systems were unable to take advantage of those new modes as they had not yet been implemented in the Linux kernel bonding driver. So, except for basic faulttolerance, Linux channel bonding could not positively combine separate Ethernet devices to provide the necessary bandwidth.
Automated Fingerprint Identification has a history of more than 20 years. In the last 5 years, there has been an explosion of technologies that have dramatically changed the face of AFIS. Few other engineering and science fields offer such a widespread use of technology as does computerized fingerprint recognition. Optics, computer vision, computer graphics, artificial intelligence, artificial neural networks, parallel processing, distributed client server applications, faulttolerant computing, scaleable architectures, local and wide area networking, mass storage, databases, are a few of the fields that have made quantum leaps in recent years. All of these improvements have a dramatic effect on the size, speed, and accuracy of automated fingerprint identification systems. ThIs paper offers a historical overview of these trends and discuss the state of the art. It culminates with an overview an educated forecast on future systems, especially those 'real time' systems for use in area of law enforcement and civil/commercial applications.
The Beam Dumping System of the Large Hadron Collider, presently under construction at CERN, must function with utmost reliability to protect the personnel, minimize the risk of severe damage to the machine and avoid undue impact to the environment. The dumping action must be synchronized with the particle free gap and the field of the extraction and dilution elements must be well adjusted to the beam energy. The measures taken to arrive at a reliable and safe system will be described, like the adoption of faulttolerant design principles and other safety related features as comprehensive monitoring, diagnostics and protection facilities. These issues will be discussed in the general framework of the IEC standard recommendations for safety critical systems. Some examples related to the most critical functions will be included.
A document discusses placing memory modules on the high-speed serial interconnect, which is used by a spacecraft s computer elements for inter-processor communications, to allow all multiple computer system architectures to access the spacecraft data storage at the same time. Each memory board is identical electrically and receives its bus ID upon connection to the system. The computer elements are configured in a similar fashion. The architecture allows for multiple memory boards to be accessed simultaneously by different computer elements, and results in a scalable, strong, fault-tolerantsystem. The IEEE-1393 ring bus can be routed so that multiple card failures can occur and the mass memory storage will still function.
Distributed data storage systems are essential to deal with the need to store massive volumes of data. In order to make such a systemfault-tolerant, some form of redundancy becomes crucial, incurring various overheads - most prominently in terms of storage space and maintenance bandwidth requirements. Erasure codes, originally designed for communication over lossy channels, provide a storage efficient alternative to replication based redundancy, however entailing high communication overhead for maintenance, when some of the encoded fragments need to be replenished in news ones after failure of some storage devices. We propose as an alternative a new family of erasure codes called self-repairing codes (SRC) taking into account the peculiarities of distributed storage systems, specifically the maintenance process. SRC has the following salient features: (a) encoded fragments can be repaired directly from other subsets of encoded fragments by downloading less data than the size of the complete object, ensuring ...
previous work have the signi?cant drawback that data acquired under fault condi- tions are ..... changes that are not noticeable to the human observer. It is also ..... P. Smyth and J. Mellstrom, “Fault Diagnosis of Antenna Pointing Systems Using ...
In this paper, a fault estimation problem for a class of nonlinear systems subject to multiplicative faults and unknown disturbances is investigated. Multiplicative faults usually mixed with system states and inputs can cause additional complexity in the design of fault estimator due to parameter changes within process. Especially for the nonlinear system corrupted with unknown disturbances, it is not an easy work to distinguish the real fault factor from the mixed term. Under the nonlinear Lipschitz condition, the proposed robust adaptive fault estimation approach not only estimates the multiplicative faults and system states simultaneously, but also extracts the real effect of the faults. Meanwhile, the effect of disturbances is restricted to an L 2 gain performance criteria which can be...
This paper analyzes the root causes of safty-related software faults identified as potentially hazardous to the system are distributed somewhat differently over the set of possible error causes than non-safety-related software faults.
In this paper, a fiber optic based sensor capable of fault detection in both radial and network overhead transmission power line systems is investigated. Bragg wavelength shift is used to measure the fault current and detect fault in power systems. Magnetic fields generated by currents in the overhe...
The paper considers the consensus problem in a partially synchronous system with Byzantine faults. It turns out that, in the partially synchronous system, all deterministic algorithms that solve consensus with Byzantine faults are leader-based. This is not the case of benign faults, which raises the...
The timing of faulting episodes can be constrained by radiometric dating of fault-zone rocks. Fault-zone material suitable for dating is produced by tectonic processes, such as (1) fragmentation of host rocks, followed by grain-size reduction and recrystallization to form mica and clay minerals, (2) secondary heating/melting of host rocks by frictional fault motions, and (3) mineral vein formation as a result of fluid advection associated with the fault motions. The thermal regime of fault zones consists primarily of the following three factors: (a) regional geothermal structure across the fault zone and background thermal history of studied province bounded by faultsystems, (b) frictional heating of wall rocks by fault motions, and (c) heating of host rocks by hot fluid advection in and around the fault zone. Thermochronological methods widely applied in fault zones are K-Ar (40Ar/39Ar), fission-track, and U-Th methods, for which methodological principles as well as analytical procedures are briefly described. The thermal sensitivities of individual thermochronological systems are then reviewed, which critically control the response of each method against the thermal processes. Based on the knowledge above, representative examples as well as key issues are highlighted to date fault gouges, pseudotachylytes, mylonites and carbonate veins, placing new constraints upon geological, geomorphological and seismological frames. Finally, the Nojima Fault is presented as an example for multiple applications of thermochronological methods in a complex fault zone.
The San Gabriel fault (SGF) in southern California is a right-lateral, strike-slip fault extending for 85 mi in an arcuate, southwestward-bowing curve from near the San Andreas fault at Frazier Mountain to its intersection with the left-lateral San Antonio Canyon fault (SACF) in the eastern San Gabriel Mountains. Termination of the SGF at the presently active SACF is abrupt and prompts the question Has the San Gabriel Fault been offset. Tectonic and geometric relationships in the area suggest that the SGF has been offset approximately 6 mi in a left-lateral sense and that the offset continuation of the SGF, across the SACF, is the right-lateral, strike-slip San Jacinto fault (SJF), which also terminates at the SACF. Reversing the left-lateral movement on the SACF to rejoin the offset ends of the SGF and SJF reveals a fault trace that is remarkably similar in geometry and movement (and perhaps in tectonic history), to the trace of the San Andreas fault through the southern part of the San Bernardino Mountains. The relationship of the Sierra Madre-Cucamonga faultsystem to the restored SGF-SJF fault is strikingly similar to the relationship of the Banning fault to the Mission Creek-Mill Creek portion of the San Andreas fault. Structural relations suggest that the San Gabriel-San Jacinto system predates the San Andreas fault in the eastern San Gabriel Mountains and that continuing movement on the SACF is currently affecting the trace of the San Andreas fault in the Cajon Pass area.
Northern parts of the Ganga Yamuna Interfluve in the Gangetic Plains, India have been investigated by remote sensing and Ground Penetrating Radar (GPR) techniques. Digital analysis of remote sensing data and Geographical Information System (GIS) techniques were used to locate a new active transverse Muzaffarnagar Fault and confirmed an earlier described Solani-II Fault in almost flat or gently sloping terrain. The Solani-II and Muzaffarnagar faults are members of two major systems of surficial faults i.e. longitudinal and transverse faults, respectively. Longitudinal faults are curvilinear in nature, trending N S in the northern regions and veering to E W in the southern regions of the plains and transverse faults are normal to the longitudinal faults occurring in the Upper Gangetic plains. GPR survey was carried out by common offset method across the Muzaffarnagar and the Solani-II faults, using a 100 MHz antenna. Our GPR data indicate that both regions around the Solani-II Fault and Muzaffarnagar Fault are characterized by 2 3 major steeply dipping normal faults at shallow depth (II and Muzaffarnagar faults probably developed at about 2.5 ka and almost at the same time fans were deposited on the downthrown block of the Muzaffarnagar Fault.
The timing of faulting episodes can be constrained by radiometric dating of fault-zone rocks. Fault-zone material suitable for dating is produced by tectonic processes, such as (1) fragmentation of host rocks, followed by grain-size reduction and recrystallization to form mica and clay minerals, (2) secondary heating/melting of host rocks by frictional fault motions, and (3) mineral vein formation as a result of fluid advection associated with the fault motions. The thermal regime of fault zones consists primarily of the following three factors: (a) regional geothermal structure across the fault zone and background thermal history of studied province bounded by faultsystems, (b) frictional heating of wall rocks by fault motions, and (c) heating of host rocks by hot fluid advection in and ...
The MFTF experiment's Sustaining Neutral Beam Power Supply System (SNBPSS) includes twenty-four 95 kV, 80 A accel dc power supplies (ADCPS). Each power supply includes a relatively high-impedance (20 percent) rectifier transformer and a step voltage regulator with a 50-100 percent voltage range. With this combination, the fault current for some postulated faults may be lower than the supply's full load current at maximum voltage. A design has been developed which uses protective relays and current-limiting fuses coordinated to detect phase and ground faults, DC faults, incorrect voltage conditions, rectifier faults, power factor correction capacitor faults, and overloads. This unusual solution ensures fast tripping on potentially destructive high-current faults and long-time delays at lower currents to allow 30 second pulse operation. The ADCPS meets the LLL specification that all major assemblies be self-protecting, that is, able to sustain external faults without damage to minimize damage due to internal faults.
Large-scale geometrical complexities along faults are known to be likely endpoints for coseismic rupture, as suggested by analysis of historic ruptures and corroborated by models of rupture on bent or discontinuous faults. However, natural faults also include smaller-scale complexities. We use the 3D finite element method to model dynamic ruptures on strike-slip fault stepovers with a smaller intermediate fault between the main strands. We find that such small faults can have a controlling effect on rupture behavior and ground motion intensity and distribution. In particular, the intermediate fault can either aid or prevent rupture propagation across the stepover, depending on its length and basal depth. The results have important implications for hazard assessment of faults with large- and small-scale geometrical complexities, and also suggest that more site-specific modeling studies may be necessary to develop realistic rupture scenarios for individual complex faultsystems.
Oblique rifting in the central and northern Main Ethiopian Rift (MER) has resulted in a complex structural pattern characterized by two differently oriented faultsystems: a set of NE-SW-trending boundary faults and a system of roughly NNE-SSW-oriented fault swarms affecting the rift floor (Wonji faults). Boundary faults formed oblique to the regional extension vector, likely as a result of the oblique reactivation of a pre-existing deep-seated rheological anisotropy, whereas internal Wonji faults developed sub-orthogonal to the stretching direction. Previous works have successfully reconciled this rift architecture and fault distribution with the long-term plate kinematics; however, at a more local scale, fault-slip and earthquake data reveal significant variations in the orientation the minimum principal stress and related fault-slip direction across the rift valley. Whereas fault measurements indicate a roughly N95°E extension on the axial Wonji faults, a N105°E to N110°E directed minimum principal stress is observed along boundary faults. Both fault-slip data and analysis of seismicity indicate a roughly pure dip-slip motion on the boundary faults, despite their orientation (oblique to the regional extension vector) should result in an oblique displacement. To shed light on the process driving the variability of data derived from fault-slip (and seismicity) analysis we present crustal-scale analogue models of oblique rifting, deformed in a large-capacity centrifuge by using materials and boundary conditions described in several previous modeling works. As in these previous works, the experiments show the diachronous activation of two faultsystems, boundary and internal, whose pattern strikingly resemble that observed in previous lithospheric-scale modeling, as well as that described in the MER. Internal faults arrange in two different, en-echelon segments connected by a transfer zone where strike-slip displacement dominates. Whereas internal faults develop roughly orthogonal to the extension direction, boundary faults form oblique to the imposed stretching vector: as a group, the faults follow the rift trend, controlled by a pre-existing weak anisotropy, but individually they form oblique to both the rift margin and the extension vector. Detailed analysis of fault displacements suggest that whereas the average displacement on single internal faults is consistent with the imposed direction of extension, slip on boundary faults does not parallel this direction; the average motion on these faults is orthogonal to the faults, resulting in a roughly pure dip-slip motion. This gives rise to a marked difference in fault-slip direction between internal faults (where slip orientation follow the regional extension) and boundary faults (where displacement is oblique to the "regional" extension). A similar scenario is observed for the reconstructed direction of the minimum principal stress that follows the regional stress field within the rift and is re-oriented at rift margins. Minor counterclockwise block rotations accommodate the different slip along the different faultsystems. The model-to-nature striking is striking in terms of fault orientation, stress and slip orientation and its across-axis variations. The analogue models thus allows explaining the across-axis variability observed in natural fault-slip and earthquake data. Modeling results support that boundary faults form in response to a local stress re-orientation imposed by a deep seated anisotropy: their displacement trajectories deviate from those imposed by the regional extension, resulting in a pure dip-slip motion in an overall oblique rifting kinematics, as observed in other sectors of the East African Rift. Conversely, internal faults -which form later and affect a weaker, more uniform lithosphere- respond directly to the regional extension direction resulting in a fault slip sub-parallel to the Nubia-Somalia motion. Minor counterclockwise block rotations are required to accommodate the difference in slip along the different faultsystems.
Various papers on control are presented. The general topics addressed include: neural networks; robust stability; process control; large-scale and parallel computations in control; optimal control; mechanical and electrical control applications; identification of feedback systems; linear multivariable systems; advanced weapon control technology; numerical methods and computation; sampled data systems; l1, l2, H2, and H(infinity) control; force control; alpha-beta target tracking systems; variable structure control; robust control; stochastic systems; discrete time systems; H(infinity) control; decentralized systems; robotics; identification of multivariable systems; control of advanced aircraft; applications in nonlinear control; mixed H2/H(infinity) control; motion planning and trajectory control; aircraft flight guidance and control; model predictive control; parallel computing for power systems analysis; energy systems. Also addressed are: tracking and estimation; learning control; intelligent process control; analysis of systems with delay; spacecraft attitude dynamics and control; benchmark problems for robust control design; large power systems; fault-tolerantsystems; modelling and control of distributed systems; estimation and hypothesis testing; chaotic systems; fuzzy control; intelligent control applications; identification and output error estimation; control of flexible structures; nonlinear process control; learning control and neural networks; repetitive control; discrete-event systems; interplay between identification and control design; covariance and Lyapunov-based control; robust stabilization and performance; nonlinear systems; identification and interpolation; estimation and filtering.
The author presents a technique for converting digraph models, including those models containing cycles, to a fault-tree format. A computer program which automatically performs this translation using an object-oriented representation of the models has been developed. The fault-trees resulting from translations can be used for fault-tree analysis and diagnosis. Programs to calculate fault-tree and digraph cut sets and perform diagnosis with fault-tree models have also been developed. The digraph to fault-tree translation system has been successfully tested on several digraphs of varying size and complexity. Details of some representative translation problems are presented. Most of the computation performed by the program is dedicated to finding minimal cut sets for digraph nodes in order to break cycles in the digraph. Fault-trees produced by the translator have been successfully used with NASA's Fault-Tree Diagnosis System (FTDS) to produce automated diagnostic systems.
This paper reports internal structures of a bedding-parallel fault in Permian limestone at Xiaojiaqiao outcrop that was moved by about 0.5 m during the 2008 M W7.9 Wenchuan earthquake. The fault is located about 3 km to the south from the middle part of Yingxiu-Beichuan fault, a major fault in the Longmenshan faultsystem that was moved during the earthquake. The outcrop is also located at Anxian transfer zone between the northern and central segments of Yingxiu-Beichuan fault where faultsystem is complex. Thus the fault is an example of subsidiary faults activated by Wenchuan earthquake. The fault has a strike of 243° or N63°E and a dip of 38°NW and is nearly optimally oriented for thrust motion, in contrast to high-angle coseismic faults at most places. Surface outcrop and two shallow drilling studies reveal that the fault zone is several centimeters wide at most and that the coseismic slip zone during Wenchuan earthquake is about 1 mm thick. Fault zone contains foliated cataclasite, fault breccia, black gouge and yellowish gouge. Many clasts of foliated cataclasite and black gouge contained in fault breccia indicate multiple slip events along this fault. But fossils on both sides of fault do not indicate clear age difference and overall displacement along this fault should not be large. We also report results from high-velocity friction experiments conducted on yellowish gouge from the fault zone using a rotary shear low to high-velocity frictional testing apparatus. Dry experiments at normal stresses of 0.4 to 1.8 MPa and at slip rates of 0.08 to 1.35 m/s reveal dramatic slip weakening from the peak friction coefficient of around 0.6 to very low steady-state friction coefficient of 0.1-0.2. Slip weakening parameters of this carbonate fault zone are similar to those of clayey fault gouge from Yingxiu-Beichuan fault at Hongkou outcrop and from Pingxi fault zone. Our experimental result will provide a condition for triggering movement of subsidiary faults or off-fault damage during a large earthquake.
Triple Modular Redundancy (TMR) is a suitable faulttolerant technique for SRAM-based FPGA. However, one of the main challenges in achieving 100% robustness in designs protected by TMR running on programmable platforms is to prevent upsets in the routing from provoking undesirable connections between signals from distinct redundant logic parts, which can generate an error in the output. This paper investigates the optimal design of the TMR logic (e.g., by cleverly inserting voters) to ensure robustness. Four different versions of a TMR digital filter were analyzed by fault injection. Faults were randomly inserted straight into the bitstream of the FPGA. The experimental results presented in this paper demonstrate that the number and placement of voters in the TMR design can directly affect the faulttolerance, ranging from 4.03% to 0.98% the number of upsets in the routing able to cause an error in the TMR circuit.
The objects of a HLA-based simulation can access model services to update their attributes. However, the grid server may be overloaded and refuse the model service to handle objects accesses. Because these objects have been accessed this model service during last simulation loop and their medium state are stored in this server, this may terminate the simulation. A fault-tolerance mechanism must be introduced into simulations. But the traditional fault-tolerance methods cannot meet the above needs because the transmission latency between a federate and the RTI in grid environment varies from several hundred milliseconds to several seconds. By adding model service URLs to the OMT and expanding the HLA services and model services with some interfaces, this paper proposes a self-adaptive fault-tolerance mechanism of simulations according to the characteristics of federates accessing model services. Benchmark experiments indicate that the expanded HLA/RTI can make simulations self-adaptively run in the grid environment.
First, we study the Unconstrained Fault-Tolerant Resource Allocation (UFTRA) problem (a.k.a. FTFA problem in \\cite{shihongftfa}). In the problem, we are given a set of sites equipped with an unconstrained number of facilities as resources, and a set of clients with set $\\mathcal{R}$ as corresponding connection requirements, where every facility belonging to the same site has an identical opening (operating) cost and every client-facility pair has a connection cost. The objective is to allocate facilities from sites to satisfy $\\mathcal{R}$ at a minimum total cost. Next, we introduce the Constrained Fault-Tolerant Resource Allocation (CFTRA) problem. It differs from UFTRA in that the number of resources available at each site $i$ is limited by $R_{i}$. Both problems are practical extensions of the classical Fault-Tolerant Facility Location (FTFL) problem \\cite{Jain00FTFL}. For instance, their solutions provide optimal resource allocation (w.r.t. enterprises) and leasing (w.r.t. clients) strategies for the cont...
The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both th...
A novel fault-tolerant five-input majority gate for quantum-dot cellular automata is presented. Quantum-dot cellular automata (QCA) is an emerging technology which is considered to be presented in future computers. Two principle logic elements in QCA are ???majority gate??? and ???inverter.??? In this paper, we propose a new approach to the design of fault-tolerant five-input majority gate by considering two-dimensional arrays of QCA cells. We analyze faulttolerance properties of such block five-input majority gate in terms of misalignment, missing, and dislocation cells. Some physical proofs are used for verifying five-input majority gate circuit layout and functionality. Our results clearly demonstrate that the redundant version of the block five-input majority gate is more robust than ...
The NASA Lewis Research Center is developing an information-switching processor for future meshed very-small-aperture terminal (VSAT) communications satellites. The information-switching processor will switch and route baseband user data onboard the VSAT satellite to connect thousands of Earth terminals. Faulttolerance is a critical issue in developing information-switching processor circuitry that will provide and maintain reliable communications services. In parallel with the conceptual development of the meshed VSAT satellite network architecture, NASA designed and built a simple test bed for developing and demonstrating baseband switch architectures and fault-tolerance techniques. The meshed VSAT architecture and the switching demonstration test bed are described, and the initial switching architecture and the fault-tolerance techniques that were developed and tested are discussed.
A novel fault-tolerant five-input majority gate for quantum-dot cellular automata is presented. Quantum-dot cellular automata (QCA) is an emerging technology which is considered to be presented in future computers. Two principle logic elements in QCA are â??majority gateâ?? and â??inverter.â?? In this paper, we propose a new approach to the design of fault-tolerant five-input majority gate by considering two-dimensional arrays of QCA cells. We analyze faulttolerance properties of such block five-input majority gate in terms of misalignment, missing, and dislocation cells. Some physical proofs are used for verifying five-input majority gate circuit layout and functionality. Our results clearly demonstrate that the redundant version of the block five-input majority gate is more robust than ...
In order to guarantee the safety of nuclear power plants (NPP), we built two real-time fault diagnosis systems adopting VISUAL BAS6.0 programming language, which apply neural network technology and data fusion technology respectively. The fault diagnosis systems interchange data with the simulator timely utilizing communication interface. We insert faults on simulator to test the two systems on line. The advantages and disadvantages are illuminated and contrasted through analyzing the faults diagnostic results off- line, which establish the foundation for the further research and application to the fault diagnosis system of the nuclear power plants. (authors)
The present symposium on gyro technology discusses design considerations for highly reliable hard- and software faulttolerant inertial reference systems, the constant value thresholds of FDI in strapdown inertial navigation, the LISA 6000, and in-flight alignment and calibration of a tactical missile INS. Attention is given to the development of a solid-state gyroscope, the development of an inexpensive gyroscope based on the principle of a vibrating piezoelectric plate, the theory and design considerations of a new two-axis angular rate gyro, and the manufacturing and testing of the Smiths 3000 microminiature DTG. Topics addressed include a down sampling technique for fiber gyroscope signal processing, second-order backscatter-induced crosstalk in a multiplexed 2D gyro system, a low-cost tactical laser gyroscope with no moving parts, and input axis measurements for a rate-biased laser gyro.
The reactor delta T'', the difference between the average core inlet and outlet temperatures, for the liquid-sodium-cooled Experimental Breeder Reactor 2 is empirically synthesized in real time from, a multitude of examples of past reactor operation. The real-time empirical synthesis is based on reactor operation. The real-time empirical synthesis is based on system state analysis (SSA) technology embodied in software on the EBR 2 data acquisition computer. Before the real-time system is put into operation, a selection of reactor plant measurements is made which is predictable over long periods encompassing plant shutdowns, core reconfigurations, core load changes, and plant startups. A serial data link to a personal computer containing SSA software allows the rapid verification of the predictability of these plant measurements via graphical means. After the selection is made, the real-time synthesis provides a fault-tolerant estimate of the reactor delta T accurate to {plus}/{minus}1{percent}. 5 refs., 7 figs.
Abstract The use of performance specifications became a key US Department of Defense (DoD) acquisition reform in the 1990s. Gone were the detailed military specifications for parts and materials selection, workmanship, derating, and faulttolerance. Maximum use was to be made of commercial specs. Reliability would be specified only in quantitative terms such as Mean-Time-To-Failure (MTTF) and/or the reliability R(t). However, there are significant differences between commercial and military weapon systems whereby performance specifications might bring in significantly greater mission risk. Reliability data from 1996 to 2000 might be an indicator of negative unintended consequences of the cancellation of military specifications. The acquisition of successful military systems requires a mix ...
Tests have been carried out on cows to determine whether steady-state tingle voltages affect milk production or growth rate. Alternating current tolerance levels for cows and pigs were discussed. Metal objects in barns carry some voltage because of nearby electrical activities. As a result, livestock could carry some current through their body and foot-to-ground impedance. Measurement techniques were developed and applied to three barns in Ontario. The highest electrical transients occurred during lightning storms. Electric fence generators were found to emit continuous low-magnitude surges. The presence of load switching and power systemfault transients was recorded. Tingle filters, inserted between the power system neutral and the ground, were found to be effective for isolating transient stray voltages. 37 refs., 4 tabs., 98 figs.
Determinism - the property that executing a program on given inputs always yields the same result - not only makes software development easier, but is indispensable to many advanced debugging, faulttolerance, security, and accountability techniques. Today's systems do not naturally support determinism, however, especially in the cloud: many-core parallelism makes execution timing-sensitive, and even correct code can yield different results on different processors in a heterogeneous cluster. To provide deterministic parallelism, we propose an "isolate-and-reconcile" execution model, in which threads work on private copies of shared memory or file system state except at timing-insensitive synchronization points, where these working copies are merged. To address heterogeneity, we explore techniques for canonicalizing or hiding architecturally unpredictable results, and examine the cost of eliminating floating-point primitives that are fundamentally difficult to canonicalize. Preliminary measurements suggest tha...
While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce faulttolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by e...
For over a decade, dCache has been synonymous with large-capacity, fault-tolerant storage using commodity hardware that supports seamless data migration to and from tape. Over that time, it has satisfied the requirements of various demanding scientific user communities to store their data, transfer it between sites and fast, site-local access. When the dCache project started, the focus was on managing a relatively small disk cache in front of large tape archives. Over the project's lifetime storage technology has changed. During this period, technology changes have driven down the cost-per-GiB of harddisks. This resulted in a shift towards systems where the majority of data is stored on disk. More recently, the availability of Solid State Disks, while not yet a replacement for magnetic disks, offers an intriguing opportunity for significant performance improvement if they can be used intelligently within an existing system. New technologies provide new opportunities and dCache user communities' computi...
Access control is an issue of paramount importance in cyber-physical systems (CPS). In this paper, an access control scheme, namely FEAC, is presented for CPS. FEAC can not only provide the ability to control access to data in normal situations, but also adaptively assign emergency-role and permissions to specific subjects and inform subjects without explicit access requests to handle emergency situations in a proactive manner. In FEAC, emergency-group and emergency-dependency are introduced. Emergencies are processed in sequence within the group and in parallel among groups. A priority and dependency model called PD-AGM is used to select optimal response-action execution path aiming to eliminate all emergencies that occurred within the system. Fault-tolerant access control polices are used to address failure in emergency management. A case study of the hospital medical care application shows the effectiveness of FEAC.
Virtualization has become commonplace in modern data centers, often referred as "computing clouds". The capability of virtual machine live migration brings benefits such as improved performance, manageability and faulttolerance, while allowing workload movement with a short service downtime. However, service levels of applications are likely to be negatively affected during a live migration. For this reason, a better understanding of its effects on system performance is desirable. In this paper, we evaluate the effects of live migration of virtual machines on the performance of applications running inside Xen VMs. Results show that, in most cases, migration overhead is acceptable but cannot be disregarded, especially in systems where availability and responsiveness are governed by strict Service Level Agreements. Despite that, there is a high potential for live migration applicability in data centers serving modernInternet applications. Our results are based on a workload covering the domain of multi-tier We...
High power space systems which use low dc voltage, high current sources such as thermoelectric generators, will most likely require high voltage conversion for transmission purposes. This study considers the use of the Schwarz resonant converter for use as the basic building block to accomplish this low-to-high voltage conversion for either a dc or an ac spacecraft bus. The Schwarz converter has the important assets of both inherent faulttolerance and resonant operation and parallel operation in modular form is possible. A regulated dc spacecraft bus requires only a single stage converter while a constant frequency ac bus requires a cascaded Schwarz converter configuration. If the power system requires constant output power from the dc generator, then a second converter is required to route unneeded power to a ballast load.
Are systems that display Topological Quantum Order (TQO), and have a gap to excitations, hardware fault-tolerant at finite temperatures? We show that in surface code models that display low d-dimensional Gauge-Like Symmetries, such as Kitaev's and its generalizations, the expectation value of topological symmetry operators vanishes at any non-zero temperature, a phenomenon that we coined thermal fragility. The autocorrelation time for the non-local topological quantities in these systems may remain finite even in the thermodynamic limit. We provide explicit expressions for the autocorrelation functions in Kitaev's model. If temperatures far below the gap may be achieved then these autocorrelation times, albeit finite, can be made large. The physical engine behind the loss of correlations at large spatial and/or temporal distance is the proliferation of topological defects at any finite temperature as a result of a dimensional reduction. This raises an important question: How may we best quantify the degree of...
In multiprogrammed systems, synchronization often turns out to be a performance bottleneck and the source of poor fault-tolerance. Wait-free and lock-free algorithms can do without locking mechanisms, and therefore do not suffer from these problems. We present an efficient almost wait-free algorithm for parallel accessible hashtables, which promises more robust performance and reliability than conventional lock-based implementations. Our solution is as efficient as sequential hashtables. It can easily be implemented using C-like languages and requires on average only constant time for insertion, deletion or accessing of elements. Apart from that, our new algorithm allows the hashtables to grow and shrink dynamically when needed. A true problem of lock-free algorithms is that they are hard to design correctly, even when apparently straightforward. Ensuring the correctness of the design at the earliest possible stage is a major challenge in any responsible system development. Our algorithm contains 81 atomic st...
Chelonia is a novel grid storage system designed to fill the requirements gap between those of large, sophisticated scientific collaborations which have adopted the grid paradigm for their distributed storage needs, and of corporate business communities gravitating towards the cloud paradigm. Chelonia is an integrated system of heterogeneous, geographically dispersed storage sites which is easily and dynamically expandable and optimized for high availability and scalability. The architecture and implementation in term of web-services running inside the Advanced Resource Connector Hosting Environment Dameon (ARC HED) are described and results of tests in both local -area and wide-area networks that demonstrate the faulttolerance, stability and scalability of Chelonia will be presented. In addition, example setups for production deployments for small and medium-sized VO's are described.
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.
In complex software systems, modularity and readability tend to be degraded owing to inseparable interactions between concerns that are distinct features in a program. Such interactions result in tangled code that is hard to develop and maintain. Aspect-Oriented Programming (AOP) is a powerful method for modularizing source code and for decoupling cross-cutting concerns. A?decade of growing research on AOP has brought the paradigm into many exciting areas. However, pioneering work on AOP has not flourished enough to enrich the design of distributed systems using the refined AOP paradigm. This article investigates three case studies that cover time-honored issues such as fault-tolerant computing, network heterogeneity, and object replication in the cluster computing community using the AOP ...
In this paper we outline a software development process for safety-critical systems that aims at combining some of the specific strengths of model-based development with those of programming language based development using safety-critical subsets of Ada. Model-based software development and model-based test case generation techniques are combined with code generation techniques and tools providing a transition from model to code both for a system itself and for its test cases. This allows developers to combine domain-oriented, model-based techniques with source code based validation techniques, as required for conformity with standards for the development of safety-critical software, such as the avionics standard RTCA/DO-178B. We introduce the AutoFocus and Validator modeling and validation toolset and sketch its usage for modeling, test case generation, and code generation in a combined approach, which is further illustrated by a simplified leading edge aerospace model with built-in faulttolerance.
Peer-to-Peer (P2P) networking is an alternative to the cloud computing for relatively more informal trade. One of the major obstacles to its development is the free riding problem, which significantly degrades the scalability, faulttolerance and content availability of the systems. Bartering exchange ring based incentive mechanism is one of the most common solutions to this problem. It organizes the users with asymmetric interests in the bartering exchange rings, enforcing the users to contribute while consuming. However the existing bartering exchange ring formation approaches have inefficient and static limitations. This paper proposes a novel cluster based incentive mechanism (CBIM) that enables dynamic ring formation by modifying the Query Protocol of underlying P2P systems. It also u...
Microsoft Cluster Service (MSCS) extends the Win-dows NT operating system to support high-availability services. The goal is to offer an execution environment where off-the-shelf server applications can continue to operate, even in the presence of node failures. Later ver-sions of MSCS will provide scalability via a node and application management system that allows applications to scale to hundreds of nodes. This paper provides a de-tailed description of the MSCS architecture and the de-sign decisions that have driven the implementation of the service. The paper also describes how some major appli-cations use the MSCS features, and describes features added to make it easier to implement and manage fault-tolerant applications on MSCS.
We propose families of protocols for magic state distillation -- important components of faulttolerance schemes --- for systems of odd prime dimension. Our protocols utilize quantum Reed-Muller codes with transversal non-Clifford gates. We find that in higher dimensions smaller codes can be used than one might expect based on qubit codes. All our protocols produce magic states at a resource cost that increases only polynomially with the inverse of the final ouput error probability. We give specific details for 3-dimensional systems, where we find that certain magic states can be distilled provided an initial error probability of less than 20.02% or a depolarizing noise rate of less than 31.7%. This is the largest error probability threshold of all known protocols with polynomial resource cost. For a depolarizing noise model we also give distillation thresholds for odd prime dimensions up-to 19.
We show that quantum systems of extended objects naturally give rise to a large class of exotic phases - namely topological phases. These phases occur when the extended objects, called ``string-nets'', become highly fluctuating and condense. We derive exactly soluble Hamiltonians for 2D local bosonic models whose ground states are string-net condensed states. Those ground states correspond to 2D parity invariant topological phases. These models reveal the mathematical framework underlying topological phases: tensor category theory. One of the Hamiltonians - a spin-1/2 system on the honeycomb lattice - is a simple theoretical realization of a faulttolerant quantum computer. The higher dimensional case also yields an interesting result: we find that 3D string-net condensation naturally gives rise to both emergent gauge bosons and emergent fermions. Thus, string-net condensation provides a mechanism for unifying gauge bosons and fermions in 3 and higher dimensions.
The current intrusion detection systems have a number of problems that limit their configurability, scalability and efficiency. There have been some propositions about distributed architectures based on multiple independent agents working collectively for intrusion detection. However, these distributed intrusion detection systems are not fully distributed as most of them centrally analyze data collected from distributed nodes which may lead to a single point of failure. In this paper, a distributed intrusion detection architecture is presented that is based on autonomous and cooperating agents without any centralized analysis components. The agents cooperate by using a hierarchical communication of interests and data, and the analysis of intrusion data is made by the agents at the lowest level of the hierarchy. This architecture provides significant advantages in scalability, flexibility, extensibility, faulttolerance, and resistance to compromise. A proof-of-concept prototype is developed and experiments ha...
Microfluidics-based biochips are soon expected to revolutionize clinical diagnosis, DNA sequencing, and other laboratory procedures involving molecular biology. Most microfluidic biochips are based on the principle of continuous fluid flow and they rely on permanently-etched microchannels, micropumps, and microvalves. We focus here on the automated design of "digital" droplet-based microfluidic biochips. In contrast to continuous-flow systems, digital microfluidics offers dynamic reconfigurability; groups of cells in a microfluidics array can be reconfigured to change their functionality during the concurrent execution of a set of bioassays. We present a simulated annealing-based technique for module placement in such biochips. The placement procedure not only addresses chip area, but it also considers faulttolerance, which allows a microfluidic module to be relocated elsewhere in the system when a single cell is detected to be faulty. Simulation results are presented for a case study involving the polymeras...
Today, many distributed applications are typically deployed at a large scale, including Grid, web search engines and content distribution networks, and it is expected for their scale to grow more in terms of number of machines, locations and administrative domains. This poses many scalability issues related to the scale of the environment they run in. To explicitly address these issues, many distributed systems and everyday services use peer-to-peer (P2P) overlays to allow other parts of the system to benefit from the fault-tolerance and scalability of P2P technology. In particular, Distributed Hash Tables (DHTs), which implement a simple put-and-get interface to a dictionary-like data structure, have been extensively used to overcome the current limitations associated with the centralized...
An agile monitoring and diagnostic system plays a vital role in manufacturing, since it can considerably increase its robustness and efficiency. Applying agent technology to such systems is recognised as an appropriate approach, providing fault-tolerance and means for failure recovery in the case of sudden anomalies. In this article, we present an automation agent approach with agents comprising a software component with an integrated world model repository besides the related hardware. The world model is an explicit representation of the external surroundings and internals of the agent, and is used to detect anomalies in its own behaviour. We use an ontology to formalise the agent's knowledge. Applicability and functionality of our approach are presented in an example employing a real sys...
Blind quantum computation is a novel secure quantum-computing protocol that enables Alice, who does not have sufficient quantum technology at her disposal, to delegate her quantum computation to Bob, who has a fully fledged quantum computer, in such a way that Bob cannot learn anything about Alice's input, output and algorithm. A recent proof-of-principle experiment demonstrating blind quantum computation in an optical system has raised new challenges regarding the scalability of blind quantum computation in realistic noisy conditions. Here we show that fault-tolerant blind quantum computation is possible in a topologically protected manner using the Raussendorf-Harrington-Goyal scheme. The error threshold of our scheme is 4.3×10-3, which is comparable to that (7.5×10-3) of non-blind topological quantum computation. As the error per gate of the order 10-3 was already achieved in some experimental systems, our result implies that secure cloud quantum computation is within reach.
A natural way for cooperative tasking in multi-agent systems is through a top-down design by decomposing a global task into subtasks such that the accomplishments of these subtasks will guarantee the achievement of the global task. In our previous work [1], we proposed necessary and sufficient conditions on the decomposability of a task automaton between two cooperative agents, and showed that given a concurrent systems and a decomposable task automaton, the global specification is satisfied by design, if local supervisors exist to satisfy local specifications. This paper represents a continuation of the work in [1], and deals with the issue when failures occur in sensors, actuators and possibly communication links. The main concern under failure is whether a previously decomposable task can still be achieved collectively by the agents. If not, we would like to investigate that under what conditions the global task could be robustly accomplished. This is actually the fault-tolerance issue of the top-down desi...
This work describes a new technique, based on exchanging control signals between neighboring nodes, for constructing a stable and fault-tolerant global clock in a distributed system with an arbitrary topology. It is shown that it is possible to construct a global clock reference with time step that is much smaller than the propagation delay over the network's links. The synchronization algorithm ensures that the global clock tick' has a stable periodicity, and therefore, it is possible to tolerate failures of links and clocks that operate faster and/or slower than nominally specified, as well as hard failures. The approach taken in this work is to generate a global clock from the ensemble of the local transmission clocks and not to directly synchronize these high-speed clocks. The steady-state algorithm, which generates the global clock, is executed in hardware by the network interface of each node. At the network interface, it is possible to measure accurately the propagation delay between neighboring nodes with a small error or uncertainty and thereby to achieve global synchronization that is proportional to these error measurements. It is shown that the local clock drift (or rate uncertainty) has only a secondary effect on the maximum global clock rate. The synchronization algorithm can tolerate any physical failure. 18 refs.
Structures along the Hopena normal fault of the Koae faultsystem (KFS) on Kilauea volcano, Hawaii, provide a record of fault propagation in three dimensions. This fault displays (1) a breached monocline along the scarp; (2) a belt of discontinuous echelon fractures along the scarp and past its end; (3) a belt of discontinuous fractures on the footwall; (4) buckles at the base of the scarp; and (5) a belt of discontinuous fractures on the hangingwall that converges towards the end of the fault trace. Solid mechanics analyses show that this ensemble can be accounted for by the tipline of a normal fault propagating up towards the surface (accompanied by antithetic fracturing), then intersecting it and propagating laterally. Normal fault propagation down from the surface cannot account for the observed structures. Lateral slip occurs along some fractures as a result of local stress changes associated with fault propagation. Discontinuous fractures at the surface form early above a blind normal fault and continue to develop as a normal fault propagates laterally. Linkage of the fractures to the fault produces a fault with an irregular trace. The discontinuous, irregular character of normal fault traces over a broad range of scale is an inevitable consequence of three-dimensional fault growth.
Northern California's tectonic is characterized by a faultsystem in which the main trace of the San Andreas fault is associated to two en échelon faultsystems: the Maacama and Roger Creek, that are the northward continuations of the Calaveras and Hayward faults in the Bay Area. Geological evidences, along with principal stresses orientations, suggest that this faultsystem behaves mechanically differently than the southern segments, and show that its frictional strength is slightly higher than in the south. In this study we investigate the influence of thermal and rheological parameters and the geometry of the faultsystem on stress orientations and long term velocities. We model the faultsystem in a cross-section view, from Pt. Arena to the West to 180 km inland, assuming there is no stress or strain variation along strike in this area. A simplified three-dimensional (3-D) model was built with a finite element code (ADELI3D). Heat flow measurements, seismicity distribution, and surface velocities were used to constrain the thermal structure of the crust and internal fault friction. The rheology of the lithosphere is composed of a frictional upper crust and a viscoelastic lower crust. Fault zones are modeled using low effective friction with respect to the surrounding crust. The lithosphere is supported by hydrostatic pressure at its base (representing the asthenosphere). We present experiments of the long term deformation of the faultsystem in northern California by adjusting the thermal field, the faultsystem geometry, and the fault rheology that control the resulting velocity and stress fields. The different hypotheses we have tested included a stratified or laterally varying temperature field, and dipping or vertical faults for the Maacama and Bartlett Spring faultsystem. Preliminary experiments show a rotation of the orientation of the maximum horizontal compression, SH, from 65-75° outside the fault zone to 45° inside. A strong lateral heat flow variation predicts that SH is oriented at a higher angle to the fault, while modeling the faultsystem to the East as dipping planes creates a lateral asymmetry of the distribution of the orientation of SH, and increases its angle to the fault between the different faultsystems at depth. Adjusting the internal friction in the different fault zones, to fit the velocity field given by GPS measurements, will allow to differentiate which one of these tests best predicts the mechanical behavior of Northern California's faultsystem.
Quality of electricity is of increasing importance within conditions enabling to maintain it to an even less extent. Now consumers neither tolerate outages with duration of seconds. This fact forces the electric utility companies to maintain network operation during single-phase earth faults. This paper presents the results of a joint research investigating the potential of overhead line poles during faults with a newly developed pole model.
A new fault-tolerant state assignment method is suggested for synchronous sequential machines. It is assumed that the inputs are fault free and that for no input it is possible to reach all or most of the states, whose number may be fairly large. Error correcting codes for the state assignment are generated by permutations of a chosen linear code. A state assignment algorithm is developed and its computational complexity is estimated. Examples are given.
Assessment of the efficiency of a faultsystem requires a complete accounting of the energy budget of the system. Work done in the deformation of a faulted area consists of four components: (1) work done resisting friction during slip on the faults (Wf ); (2) work done against gravity in uplift of topography (Wg ) (this term can be negative where deformation decreases elevation); (3) internal energy of the strained host rock (Wi); and (4) work done in initializing new faults and propagating existing faults (Wp). The energy budget of the system can thus be expressed as Wtot = Wf + Wg + Wi + Wp For a balanced energy budget the total internal work, Wtot, equals the external tectonic work on the system. We examine the work balance within successively more complex faultsystems using boundary element method models. The work balance approach has two primary benefits; it permits analysis of an entire system of interacting faults and it aids our understanding of faultsystem evolution. For example, new fault surfaces develop at the 'cost' of internal strain and/or tectonic work. Because this approach examines the entire system, the locally destructive or constructive interaction of individual faults is tempered by the role of these faults within the larger system.
Abstract in portuguese Neste trabalho, é proposto um modelo baseado na integraçăo entre processos semi-Markovianos e redes Bayesianas para avaliaçăo da disponibilidade de sistemas tolerantes ŕ falha. Esta integraçăo resulta em um modelo estocástico híbrido o qual é capaz de representar as características dinâmicas de um sistema assim como tratar as relaçőes de causa e efeito entre fatores externos tais como condiçőes ambientais e operacionais. Além disso, o modelo híbrido per (more) mite avaliar a propagaçăo de incerteza sobre a disponibilidade do sistema. É também proposto um procedimento numérico para a soluçăo das equaçőes de probabilidade de estado de processos semi-Markovianos descritos por taxas de transiçăo. Tal procedimento numérico é baseado na aplicaçăo de transformadas de Laplace que săo invertidas pelo método de quadratura Gaussiana conhecido como Gauss Legendre. O modelo híbrido e procedimento numérico săo ilustrados por meio de um exemplo de aplicaçăo no contexto de sistemas tolerantes ŕ falha. Abstract in english In this work it is proposed a model for the assessment of availability measure of faulttolerantsystems based on the integration of continuous time semi-Markov processes and Bayesian belief networks. This integration results in a hybrid stochastic model that is able to represent the dynamic characteristics of a system as well as to deal with cause-effect relationships among external factors such as environmental and operational conditions. The hybrid model also allows fo (more) r uncertainty propagation on the system availability. It is also proposed a numerical procedure for the solution of the state probability equations of semi-Markov processes described in terms of transition rates. The numerical procedure is based on the application of Laplace transforms that are inverted by the Gauss quadrature method known as Gauss Legendre. The hybrid model and numerical procedure are illustrated by means of an example of application in the context of faulttolerantsystems.
Abstract In this paper, we consider sensor and actuator fault detection and estimation issues for large scale wind turbine systems where individual pitch control (IPC) is used for load reduction. The faults considered are the blade root bending moment sensor faults and blade pitch actuator faults. In the first part, with the aid of a dynamical model of the wind turbine system, a so-called H-/H- observer in the finite frequency range, is applied to generate the residual for fault detection. The observer is designed to be sensitive to faults but insensitive to disturbances, such as wind turbulence. When there is a detectable fault, the observer sends an alarm signal if the residual evaluation is larger than a predefined threshold. In addition to the fault detection, we also consider the faul...
Past movement on faults can be dated by measurement of the intensity of ESR signals in quartz. These signals are reset by local lattice deformation and local frictional heating on grain contacts at the time of fault movement. The ESR signals then trow back as a result of bombardment by ionizing radiation from surrounding rocks. The age is obtained from the ratio of the equivalent dose, needed to produce the observed signal, to the dose rate. Fine grains are more completely reset during faulting, and a plot of age vs grain size shows a plateau for grains below critical size : these grains are presumed to have been completely zeroed by the last fault activity. We carried out ESR dating of fault rocks collected from the Yangsan faultsystem. ESR dates from the this faultsystem range from 870 to 240 ka. Results of this research suggest that long-term cyclic fault activity continued into the pleistocene.
A new fault location (FL) method using composite fiber-optic overhead ground wires (OPGWs) is developed to find out where electrical faults occur on overhead power transmission lines. This method locates the fault section by detecting the current induced in the ground wire (GW), i.e. OPGW in this system. Since detected fault information is essentially uncertain, the new FL method treats the fault information oas a current distribution pattern throughout the power line, and applies Fuzzy Theory to realize the human-like manner of fault location used by electrical power engineers. It was confirmed by computer simulations that the fault section can be accurately located using this method under various conditions. This FL system has already been applied to several commercial power transmission lines and successfully located the sections where electrical faults occurred on actual power transmission lines.
This paper presents a new and accurate algorithm for locating faults in a combined overhead transmission line with underground power cable using Adaptive Network-Based Fuzzy Inference System (ANFIS). The proposed method uses 10 ANFIS networks and consists of 3 stages, including fault type classification, faulty section detection and exact fault location. In the first part, an ANFIS is used to determine the fault type, applying four inputs, i.e., fundamental component of three phase currents and zero sequence current. Another ANFIS network is used to detect the faulty section, whether the fault is on the overhead line or on the underground cable. Other eight ANFIS networks are utilized to pinpoint the faults (two for each fault type). Four inputs, i.e., the dc component of the current, fundamental frequency of the voltage and current and the angle between them, are used to train the neuro-fuzzy inference systems in order to accurately locate the faults on each part of the combined line. The proposed method is evaluated under different fault conditions such as different fault locations, different fault inception angles and different fault resistances. Simulation results confirm that the proposed method can be used as an efficient means for accurate fault location on the combined transmission lines. (author)
Recurrent movements on the northeast-trending Reelfoot rift and west-trending Rough Creek fault zone dominated southeastern Illinois tectonic history. Early Cambrian rifting along both zones created deep trenches that began to fill with sediments. Intermittent movements continued, but faults were quiescent by the Mississippian. Then renewed extension on the Reelfoot rift in the Early Permian produced high-angle normal faults in the Wabash Valley faultsystem and fluorspar area fault complex, and the right-lateral Cottage Grove faultsystem. Igneous intrusions accompanied this action: upwelling magma formed Omaha dome; Hicks dome and associated concentric and radial faults appear to have been formed by explosive igneous activity. After the Early Permian, recurrent up-and-down movements of several thousand feet reactivated the fluorspar area fault complex and created the present day Rough Creek and Shawneetown fault zones. Blocks bordering faults returned roughly to their original positions by the Late Cretaceous, leaving narrow slices of rock upthrown and downthrown along faults. Faults in Illinois probably have been inactive since the Cretaceous Period, although the Reelfoot rift south of Cairo has been reactivated. Earthquakes in Illinois today apparently are caused by local east-west horizontal compressional stresses not related to known bedrock faults.
A quality monitoring system can detect certain systemfaults and fraud attempts in a distributed voting system. The system uses decoy voters to cast predetermined check ballots. Absent check ballots can indicate systemfaults. Altered check ballots can indicate attempts at counterfeiting votes. The system can also cast check ballots at predetermined times to provide another check on the distributed voting system.
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual Components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore ofparamount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. In this paper we will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement faulttolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high eficiency (low overhead), and transparency-requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.
This paper presents a challenging industrial benchmark for implementation of control strategies under realistic working conditions. The developed control strategies should perform in a plug & play manner, i.e. adapt to varying working conditions, optimize their performance, and provide faulttolerance. A faulttolerant strategy is needed to deal with a faulty sensor measurement of the evaporation pressure. The design and algorithmic challenges in the control of an evaporator include: unknown model parameters, large parameter variations, varying loads, and external discrete phenomena such as compressor switch on/o or abrupt change in compressor speed.
In quantum computing, where algorithms exist that can solve computational problems more efficiently than any known classical algorithms, the elimination of errors that result from external disturbances or from imperfect gates has become the "holy grail," and a worldwide quest for a large scale fault-tolerant, and computationally superior, quantum computer is currently taking place. Optimists rely on the premise that, under a certain threshold of errors, an arbitrary long fault-tolerant quantum computation can be achieved with only moderate (i.e., at most polynomial) overhead in compu
This paper proposes and analyzes some new VLSI architectures for improved faulttolerance. The architecture include structures with two planar layers of processing elements as well as extended cubic designs. The analyses for arrays with various redundancy levels show remarkable improvement in both array yield and processor use over those exhibited by conventional 2-D structures. The improvement can be attributed to the benefits of the third dimension to increase the flexibility in spares allocation. The architectures can readily substitute arrays based on mesh or four nearest-neighbor interconnections. From the fault-tolerance viewpoint the cubic structures offer no appreciable performance improvement over the simpler 2-layer structures.
Reversible logic design has become one of the promising research directions in low power dissipating circuit design in the past few years and has found its application in low power CMOS design, digital signal processing and nanotechnology. This paper presents the efficient design approaches of faulttolerant carry skip adders (FTCSAs) and compares those designs with the existing ones. Variable block carry skip logic (VBCSL) using the faulttolerant full adders (FTFAs) has also been developed. The designs are minimized in terms of hardware complexity, gate count, constant inputs and garbage outputs. Besides of it, technology independent evaluation of the proposed designs clearly demonstrates its superiority with the existing counterparts.
The San Andreas faultsystem, a complex of faults that display predominantly large-scale strike slip, is part of an even more complex system of faults, isolated segments of the East Pacific Rise, and scraps of plates lying east of the East Pacific Rise that collectively separate the North American plate from the Pacific plate. This chapter briefly describes the San Andreas faultsystem, its setting along the Pacific Ocean margin of North America, its extent, and the patterns of faulting. Only selected characteristics are described, and many features are left for depictions on maps and figures.
This paper is concerned with fault detection for a class of complex networked control systems. By introducing transmission matrix, nonlinear delayed Markovian jump systems model with partially unknown transition probabilities are established by multiple channels data transmission framework. Based on the obtained model, mode-dependent fault detection filters are used for residual generator, the addressed fault detection problem is converted into nonlinear H attenuation problem. Then the desired mode-dependent fault detection filters are constructed in terms of linear matrix inequalities such that the fault detection systems are stochastically stable with H attenuation level. A numerical example is given to demonstrate the effectiveness of the proposed design approach.
Many studies have been realized on the Durance faultsystem. One thesis showed a the possibility of a surface fault associated to an earthquake near Manosque, at a place where there were no fault pointed by the geological maps. This discovery motivates the development of complementary studies at the IPSN. These studies aims to use data and new analysis systems allowing to precise the faultssystem geometry. The document presents the four approaches used, the analysis of the results and the presentation of an example of realization: the implementation of a school in a fault region. (A.L.B.)
In this paper, a model-based fault diagnosis methodology for PEM fuel cell systems is presented. The methodology is based on computing residuals using an LPV observer. Sensor fault detection faces the problem of robustness using adaptive thresholds generated with interval observer. Fault isolation i...
An assessment on the limitations of communication with MER rovers and how such constraints drove the system design, flight software and fault protection architecture, blurring the line between traditional fault protection and expected nominal behavior, and requiring the most novel autonomous and semi-autonomous elements of the vehicle software including communication, surface mobility, attitude knowledge acquisition, fault protection, and the activity arbitration service.
The San Andreas fault is marked in the landscape by a series of linear valleys and mountain fronts, aligned lakes and bays, elongate ridges, and disrupted or offset stream channels. This chapter describes regional features, local geomorphic features within the fault zone, and gives detailed maps of the faultsystem.
Models of square and circular tunnels with short faults cutting through their surfaces are investigated by photoelasticity. These models, when duplicated by finite element analysis can predict the stress states of square or circular faulted tunnels adequately. Finite element analysis, using gap elements, may be used to investigate full size faulted tunnel system.
Normal faultsystems are basic features in commercially important geological structures like e.g. sedimentary basins. While the interpretation of seismic data sets reveals the structure of the strata and its offset by larger faults, the properties of the fault planes itself remain undetermined. The ...
Normal faults have traditionally been mapped as long, continuous features with throw decreasing steadily from a maximum near the center to zero at the tips. More recent work has suggested a different view, in which normal faults of all scales are composed of arrays of shorter segments, with relay zones shifting displacement between fault segments that overstep in map view. In some instances segments are linked across the relay zone, whereas in other cases the faults are unconnected, leaving a [open quotes]gap[close quote] in the middle of the faultsystem. Identification and correct mapping of relay zones is essential in any area where faults play a role in field segmentation. Potential relay zones can be identified by noting strike bends in the faultsystem, faultsystems that are too long for their throw, displacement anomalies, and areas where the faultsystem splays into two or more strands. The most common error in fault mapping is to interpret faults as longer and more continuous than they really are. This leads to overestimation of fault-dependent trap size and incorrect mapping of production compartments. An inaccurate compartment map can in turn lead to a less effective field development strategy, increasing the likelihood of dry holes or unnecessary wells.
Normal faults have traditionally been mapped as long, continuous features with throw decreasing steadily from a maximum near the center to zero at the tips. More recent work has suggested a different view, in which normal faults of all scales are composed of arrays of shorter segments, with relay zones shifting displacement between fault segments that overstep in map view. In some instances segments are linked across the relay zone, whereas in other cases the faults are unconnected, leaving a {open_quotes}gap{close_quote} in the middle of the faultsystem. Identification and correct mapping of relay zones is essential in any area where faults play a role in field segmentation. Potential relay zones can be identified by noting strike bends in the faultsystem, faultsystems that are too long for their throw, displacement anomalies, and areas where the faultsystem splays into two or more strands. The most common error in fault mapping is to interpret faults as longer and more continuous than they really are. This leads to overestimation of fault-dependent trap size and incorrect mapping of production compartments. An inaccurate compartment map can in turn lead to a less effective field development strategy, increasing the likelihood of dry holes or unnecessary wells.
Nuclear plants of the 21st century will employ higher levels of automation and faulttolerance to increase availability, reduce accident risk, and lower operating costs. Key developments in control algorithms, fault diagnostics, faulttolerance, and communication in a distributed system are needed to implement the fully automated plant. Equally challenging will be integrating developments in separate information and control fields into a cohesive system, which collectively achieves the overall goals of improved performance, safety, reliability, maintainability, and cost-effectiveness. Under the Nuclear Energy Research Initiative (NERI), the U. S. Department of Energy is sponsoring a project to address some of the technical issues involved in meeting the long-range goal of 21st century reactor control systems. This project, ''A New Paradigm for Automated Development Of Highly Reliable Control Architectures For Future Nuclear Plants,'' involves researchers from Oak Ridge National Laboratory, University of Tennessee, and North Carolina State University. This paper documents a research effort to develop methods for automated generation of control systems that can be traced directly to the design requirements. Our final goal is to allow the designer to specify only high-level requirements and stress factors that the control system must survive (e.g. a list of transients, or a requirement to withstand a single failure.) To this end, the ''control engine'' automatically selects and validates control algorithms and parameters that are optimized to the current state of the plant, and that have been tested under the prescribed stress factors. The control engine then automatically generates the control software from validated algorithms. Examples of stress factors that the control system must ''survive'' are: transient events (e.g., set-point changes, or expected occurrences such a load rejection,) and postulated component failures. These stress factors are specified by the designer and become a database of prescribed transients and component failures. The candidate control systems are tested, and their parameters optimized, for each of these stresses. Examples of high-level requirements are: response time less than xx seconds, or overshoot less than xx% ... etc. In mathematical terms, these types of requirements are defined as ''constraints,'' and there are standard mathematical methods to minimize an objective function subject to constraints. Since, in principle, any control design that satisfies all the above constraints is acceptable, the designer must also select an objective function that describes the ''goodness'' of the control design. Examples of objective functions are: minimize the number or amount of control motions, minimize an energy balance... etc.
The dilatational strains associated with vertical faults embedded in a horizontal plate are examined in the framework of fault kinematics and simple displacement boundary conditions. Using boundary element methods, a sequence of examples of dilatational strain fields associated with commonly occurring strike-slip fault zone features (bends, offsets, finite rupture lengths, and nonuniform slip distributions) is derived. The combinations of these strain fields are then used to examine the Parkfield region of the San Andreas faultsystem in central California.
One of the key components of a Fourier Transform Infrared Spectrometer (FTIR) is the linear translation stage used to vary the optical path length between the two arms of the interferometer. This translation mechanism must produce extremely constant velocity motion across its entire range of travel to allow the instrument to attain high signal-to-noise ratio and spectral resolving power. A new spectrometer is being developed at the Jet Propulsion Laboratory under NASA s Planetary Instrument Definition and Development Program (PIDDP). The goal of this project is to build upon existing spaceborne FTIR spectrometer technology to produce a new instrument prototype that has drastically superior spectral resolution and substantially lower mass, making it feasible for planetary exploration. In order to achieve these goals, Alliance Spacesystems, Inc. (ASI) has developed a linear translation mechanism using a novel ultrasonic piezo linear motor in conjunction with a fully kinematic, faulttolerant linear rail system. The piezo motor provides extremely smooth motion, is inherently redundant, and is capable of producing unlimited travel. The kinematic rail uses spherical Vespel(R). rollers and bushings, which eliminates the need for wet lubrication, while providing a faulttolerant platform for smooth linear motion that will not bind under misalignment or structural deformation. This system can produce velocities from 10 - 100 mm/s with less than 1% velocity error over the entire 100-mm length of travel for a total mechanism mass of less than 850 grams. This system has performed over half a million strokes under vacuum without excessive wear or degradation in performance. This paper covers the design, development, and testing of this linear translation mechanism as part of the Planetary Atmosphere Occultation Spectrometer (PAOS) instrument prototype development program.
Reliability of navigation data are critical for steering and manoeuvring control, and in particular so at high speed or in critical phases of a mission. Should faults occur, faulty instruments need be autonomously isolated and faulty information discarded. This paper designs a navigation solution where essential navigation information is provided even with multiple faults in instrumentation. The paper proposes a provable correct implementation through auto-generated state-event logics in a supervisory part of the algorithms. Test results from naval vessels document the performance and shows events where the fault-tolerant sensor fusion provided uninterrupted navigation data despite temporal instrument defects
Constructing a fault-tolerant quantum computer is a daunting task. Given any design, it is possible to determine the maximum error rate of each type of component that can be tolerated while still permitting arbitrarily large-scale quantum computation. It is an underappreciated fact that including an appropriately designed mechanism enabling long-range qubit coupling or transport substantially increases the maximum tolerable error rates of all components. With this thought in mind, we take the superconducting flux qubit coupling mechanism described in PRB 70, 140501 (2004) and extend it to allow approximately 500 MHz coupling of square flux qubits, 50 um a side, at a distance of up to several mm. This mechanism is then used as the basis of two scalable architectures for flux qubits taking into account crosstalk, incorporation of classical control circuitry, power dissipation, and fault-tolerant considerations such as permitting a universal set of logical gates, parallelism, measurement and initialization, and ...
Abstract In this paper, an actuator faulttolerant control (FTC) strategy based on set separation is presented. The proposed scheme employs a standard configuration consisting of a bank of observers which match the different fault situations that can occur in the plant. Each of these observers has an associated estimation error with a distinctive behaviour when a estimator matches the current fault situation of the plant. With this information from each observer, a fault diagnosis and isolation (FDI) module is able to reconfigure the control loop by selecting the appropriate stabilising controller from a bank of precomputed control laws, each of them related to one of the considered fault models. The control law consists of a reference feedforward term and a feedback gain multiplying the s...
We have used solid-state nuclear track detectors (CR-39) in order to determine the profile of the soil radon in district areas of the North and East Anatolian active faultsystems in Turkey. It has been shown that the radon anomalies among the fault zones are relatively high in the fault line while dramatically decreases by going away from the lines. Radon concentrations in both active faultsystems ranged from 4.3 to 9.8kBqm{sup -3}. The average radon concentration levels in the North Anatolian FaultSystem are relatively higher than the East Anatolian FaultSystem. Radon measurement technique is proved to be a good tool for detection and mapping of the active fault zone, and also in the case of continuous monitoring of radon anomalies connected with earthquake events.
Reflection survey is conducted using three traverse lines in Sakai City for confirming the presence of a southward stretch of the Uemachi fault underground along the western periphery of the Uemachi terrace, Osaka, and for elucidating its connection to the Sakamoto fault distributed near Iz