| Jim Gray and D. Siewiorek. High-availability computer systems. IEEE Computer, 24(9), 1991. Includes a criticism of N-version programming. |
....But these applications have a high rate of failure and faults in them are directly visible to the common man or woman. Unfortunately, we have not yet succeeded in building fault free computer systems. Computers fail due to a variety of problems with their hardware and software. Field studies [Gray91] and everyday experience show that the dominant cause of failures today is software faults, both in the application and system layers. Reducing the number of software faults and surviving the ones that remain has been an important challenge for the fault tolerance community. Researchers have ....
....in the kernel text and modifies the instruction to reflect the kind of fault we intend to inject. 9 CHAPTER 2 SOFTWARE FAULT RECOVERY AND ASSUMPTIONS 2.1 Introduction As computers become an integral part of today s society, making them dependable becomes increasingly important. Field studies [Gray91] and everyday experience make it clear that the dominant cause of failures today is software faults, both in the application and system layers. Reducing the number of failures caused by software faults is therefore an important challenge for the fault tolerance community. The best way to ensure ....
[Article contains additional citation context not shown here]
Jim Gray and Daniel P. Siewiorek. High-Availability Computer Systems. IEEE Computer, 24(9):39--48, September 1991. 100
....even with an operator with an around the clock pager that can be called in even during the middle of the night, cannot be expected to recover in a few minutes, let al..one a few seconds like our design. It usually takes much longer for a human operators to repair problems than machines do [GS91] 15 2.3 Our Approach I am proposing a self maintaining approach to system administration, in which the storage system maintains itself with only minimal help from a human operator. Instead of having someone constantly on call look after the system, our system is designed to mask problems and ....
....is important to reduce the system s down time as seen from across the Internet. Note that a system that relies on an operator to keep it running is not as available as one that maintains itself, as it will take minutes or maybe even hours for the operator to actually be able to repair the damage [GS91] Our goal is to have the system repair any interruption of service within a few seconds, and continue to function unattended until the next scheduled visit by the operator. For a web server application such as the one we are running, this is illustrated by the slogan repair by reload . As Mary ....
Jim Gray and Daniel P. Siewiorek. High-availability computer systems. IEEE Computer, 24(9), September 1991.
....RAID organization that accommodates multiple failures within reliability groups while retaining its excellent storage utilization, response time, and fault recovery properties. These schemes constitute a step towards the high availability computer systems recently advocated by Gray and Siewiorek [5]. We are concerned with exploring the performance improvements that are available within very large disk arrays. We will consider workloads featuring operations that involve a small quantity of data typical of database transactions. We are interested in considering the run time effects of various ....
Jim Gray and Daniel P. Siewiorek. "HighAvailability Computer Systems," COMPUTER, pp. 39-48, 1991.
....issues related to, reliable DSM systems. 2.3. 1 Terminology Fault tolerance discussions benefit from terminology and concepts developed by the International Federation for Information Processing Working Group 10.4 and by the IEEE Computer Society Technical Committee on Fault Tolerant Computing[26]. We may view a system as consisting of multiple modules, which are in turn composed of sub modules. A module has an ideal specified behavior and an observed actual behavior. A failure is deviation of the actual behavior from the specified behavior. A failure is caused by an error, which is a ....
Jim Gray and Daniel P. Siewiorek. High-availability computer systems. IEEE Computer, pages 39--48, September 1991.
....these channels independent, an enterprise root can validate received certificates. Users may also opt to validate certificates through several independent channels, resulting in increased confidence in their authenticity. This approach is similar to mechanisms for high availability proposed in [GS91]. The architecture must be flexible. The topology of the Internet is constantly changing, so the architecture and underlying protocols must not be dependent on the physical connectivity or location of any singular authority. Mobility of users is of equal importance. As users travel from one domain ....
Jim Grey and Daniel P. Siewiorek. HighAvailability Computer Systems. IEEE Computer, 24(9):39--48, September 1991.
....support components. An aggregate MTTDL of a million hours (114 years) translates into only a 2.6 likelihood of any data loss at all during a typical 3 year array lifetime. This is much lower than the rate of problems due to software failures, operator errors, and other environmental difficulties [Gray90, Gray91a] that is, a small to medium sized array that achieves an overall MTTDL of 1M hours or better will probably be entirely adequate for the majority of its applications. In addition to reduced failure rates, modern disks also provide feedback mechanisms for predicting when such failures will occur. ....
Jim Gray and Daniel P. Siewiorek. Highavailability computer systems. IEEE Computer, 24(9):39--48, September 1991.
....it is important to reduce the system s down time as seen from across the Internet. Note that a system that relies on an operator to keep it running is not as available as one that maintains itself, as it will take minutes or maybe even hours for the operator to actually be able to repair the damage[22]. Our goal is to have the system repair any interruption of service within a few seconds, and continue to function unattended until the next scheduled visit by the operator. For a web server application such as the one we are running, this is illustrated by the slogan repair by reload . As Mary ....
Jim Gray and Daniel P. Siewiorek. High-availability computer systems. IEEE Computer, 24(9), September 1991.
....these channels independent, an enterprise root can validate received certificates. Users may also opt to validate certificates through several independent channels, resulting in increased confidence in their authenticity. This approach is similar to mechanisms for high availability proposed in [GS91]. The architecture must be flexible. The topology of the Internet is constantly changing, so the architecture and underlying protocols must not be dependent on the physical connectivity or location of any singular authority. Mobility of users is of equal importance. As users travel from one domain ....
Jim Grey and Daniel P. Siewiorek. HighAvailability Computer Systems. IEEE Computer, 24(9):39--48, September 1991.
No context found.
Jim Gray and D. Siewiorek. High-availability computer systems. IEEE Computer, 24(9), 1991. Includes a criticism of N-version programming.
No context found.
Jim Gray. High Availability Computer Systems. IEEE Computer, Sept. 1991.
Online articles have much greater impact More about CiteSeer.IST at NUS Add search form to your site Submit documents Feedback
CiteSeer.IST at NUS - Copyright Penn State and NEC. Hosted by the School of Computing, National University of Singapore.