| G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission to the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII). |
.... Computing Researchers in the area of recovery oriented computing have developed a variety of techniques to help software recover from runtime errors [14] One of these techniques, recursive restartability, composes large systems out of many smaller modules that are individually rebootable [4]. The goal is to build systems in which faults can be isolated at the module level by rebooting. In some cases, the consequences of an error may not be immediately apparent and the system may run ahead, generating an unacceptable execution. In such cases, the ability to undo an application s ....
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110-115, Schloss Elmau, Germany, May 2001.
.... dynamic extensibility has long been promoted as a way to manage the complexity, and improve maintainability and reliability of operating systems [2 5] Recently, the low reliability of some system components, particularly device drivers, has triggered renewed efforts to isolate such components [6, 7]. The common problem here is the need to isolate untrusted (buggy or potentially malicious) code. In addition, component technology [8 10] which is an attractive way of constructing extensions, is leading to a reduced granularity of the units of code and data that require protection or ....
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th HotOS, pages 125--130, 2001.
....introspection, externalize the execution state, and provide components that can be replaced to repair the state at the application level. We propose similar techniques to be applied for repairing state while executing, and retrieving useful state after failure in a complete computer system. In [5], an all or nothing approach is taken and the system is rebooted to bring it back to a consistent working state following a failure. Our approach addresses a missing link: we reuse useful state in the system memory through repair and recovery rather than lose it by rebooting. Self adapting ....
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proc. HotOS-VIII, May 2001.
....logging requires the service to block till the log is written to stable storage. Optimistic logging allows the log message to be flushed asynchronously to stable storage while the process goes ahead with other operations. The concept of recursive restartability has recently gained popularity [5], in which a fault tolerant system allows restarting components at multiple levels depending upon severity of failure. In a similar spirit, Duplex API permits NIC NIC Persistent Storage Classification Rules and Configuration Packet Classifier and Forwarder Logging Management Module User ....
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 125--132, May 2001.
....and performance degradation still involve operators and developers in the feedback loop. After analyzing and forming a hypothesis of the system s behavior, we can close some of the feedback loops by providing a trigger for dynamic adaptation techniques. For software bugs, recursive restarts [6] bring the system back to a known, functioning state. For configuration errors, undo [5] helps the system configuration rollback to a previous working configuration. For an overloaded system, dynamic connection management [8] allows it to degrade gracefully by performing admission control or by ....
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS-VIII, 2001.
....70 days [23] which is far longer than our observed average 14 day reboot interval. Other factors leading to rejuvenation being effective for operating systems require much greater mean uptimes. Finally, recent work has proposed extending the idea of rejuvenation throughout all layers of software [5]. While all these works will improve software robustness (and will likely mask software failures that do not lead to node reboot) our characterization shows that they are unlikely to improve the masking of node failures as the entire workstation node increases in availability with time. The ....
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May 2001.
....in todays Internet services world, where there are so many people in the gold rush to get their applications online first, according to a division head of Suns consulting wing. Major internet portals are deploying code written by gumshoe engineers with little more than a week of job experience [8]. In the words of Debra Chrapraty, former CIO of E Trade, a major online brokerage service, We used to have six months of development on a product and three months of testing. We don t live that way any more. In Internet time, people get sloppy [41] In summary, blind adherence to this ....
....be reloaded upon restart. This design acknowledges that application code can be buggy and provides fast restart and recovery mechanisms. More recent work has attempted to formalize the properties of such restartable systems and to devise the most appropriate ways to perform restart based recovery [8] [28] Furthermore, anecdotal reports suggest that these design techniques are used by production Internet services [8] Our proposed work rests on the same philosophical underpinnings as this previous work, but goes beyond it in two ways. First, we include a focus on the sorely neglected problem ....
[Article contains additional citation context not shown here]
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission
....evolution makes it possible to automate many parts of the feedback loops with confidence. After analyzing and forming a hypothesis of the system s behavior, we can close some of the feedback loops by providing a trigger for dynamic adaptation techniques. For software bugs, recursive restarts [6] bring the system back to a known, functioning state. For configuration errors, undo [5] helps the system configuration rollback to a previous working configuration. For an overloaded system, dynamic connection management [8] allows it to degrade gracefully by performing admission control or by ....
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS-VIII, 2001.
.... changes, so does software, and thus the traditional highavailability design techniques of careful software engineering and extensive software testing go out the window [6] Major internet portals are deploying code written by gumshoe engineers with little more than a week of job experience [2]. In the words of Debra Chrapraty, former CIO of E Trade, a major online brokerage service, We used to have six months of development on a product and three months of testing. We don t live that way any more. In Internet time, people get sloppy [10] When people are sloppy, software bugs ....
....failures. We call this philosophy Recovery Oriented Computing (ROC) At its heart, ROC addresses hardware, software, and human failures by providing rapid and effective mechanisms for detecting and recovering from them. Recovery can take many forms from simple mechanisms like design for reboot [2], to more complex schemes such as fail stop fault containment combined with data redundancy, to full regeneration of system state from backups, checkpoints, and logs. A full discussion is outside the scope of this paper. But in all cases, these mechanisms should be designed to make as few ....
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission to the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII).
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau, Germany, 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau/Oberbayern, Germany, 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS VIII, 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS VIII, 2001.
....failures. To this end, ROC researchers and others have been investigating techniques for failure detection and recovery that are external to the application and do not rely on a priori fault models or models of the application s semantics; examples include recursive restarts as a form of recovery [4] and anomaly detection in runtime path analysis as a form of failure detection [7] Given the emergence of popular middleware platforms for Internet Web based applications, such as Java 2 Enterprise Edition (J2EE) we believe these techniques can be applied today to the middleware platform ....
....(RM) is an entity external to the application server. It attempts automated recovery and only involves system administrators when automated recovery is unsuccessful. We rely on micro reboots for recovery; micro rebooting is a good way to recover from most transient failures in Internet systems [4]. The recovery manager listens on a UDP port for failure notifications from the monitors. Using failure information, it builds up a representation of the failure propagation paths through the system in the form of a graph, whose structure is described in section 3.1. After updating the graph, the ....
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau/Oberbayern, Germany, 2001.
....station that integrates COTS components. A current goal in the design and deployment of Mercury is to improve ground station availability, as it was not originally designed with high availability in mind. Our first step in improving the availability of Mercury was to apply recursive reboots [19] to cure transient failures by restarting suitably chosen subsystems, such that overall mean time to recover (MTTR) is minimized. We had two main goals in applying RR to Mercury. The first was to partially remove the human from the loop in ground station control by automating recovery from ....
....among its logical sub components, rearchitecting along the MTTR MTTF separation lines may often turn out to be the optimal engineering choice. Balancing MTTR MTTF characteristics in every component is a step toward building a more robust and highly available system. As explained in [19], RR attempts to exploit strong existing fault isolation boundaries, such as virtual memory, physical node separation, or kernel process control, leading to higher confidence that a sequence of restarts will effectively cure transients. To preserve this property, recovery group boundaries should ....
[Article contains additional citation context not shown here]
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau/Oberbayern, Germany, 2001.
....are probably not newsworthy because they affect far fewer users. The difference, of course, is that in the latter case the same availability is achieved by having a much shorter MTTR. Frequent recovery may lengthen effective MTTF. Software rejuvenation [Garg97] and recursive restartability [Candea01] both exploit the observation that by returning a system periodically to its start state (typically a well understood and heavily tested state) we can reclaim stale resources, clean up corrupted state and other side effects of software aging, and eliminate the corresponding side effects (e.g. ....
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. Proc. Eighth Workshop on Hot Topics in Operating Systems (HotOS-VIII), Elmau, Germany, May 2001.
....that integrates COTS components. A current goal in the design and deployment of Mercury is to improve ground station availability, as it was not originally designed with high availability in mind. Our first step in improving the availability of Mercury was to apply recursive restartability [4], an approach to system recovery that advocates curing transient failures by restarting suitably chosen subsystems, such that overall mean time to recover (MTTR) is minimized. Recursive restartability is a concrete example of the recovery oriented computing (ROC) philosophy [12] as applied to ....
....characteristics among its logical sub components, rearchitecting along the MTTR MTTF separation lines may often turn out to be the optimal engineering choice. Balancing MTTF characteristics in every component is a step toward building a more robust and highly available system. As explained in [4], RR attempts to exploit strong existing fault isolation boundaries, such as virtual memory, physical node separation, or kernel process control, leading to higher confidence that a sequence of restarts will effectively cure transients. To preserve this property, restartgroup boundaries should not ....
[Article contains additional citation context not shown here]
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 110--115, Elmau, Germany, May 2001.
....the control software by a factor of almost six. Besides being a significant quantitative improvement, it also constituted a qualitative improvement that lead to nearly 100 availability of the ground station during the critical period when the satellite passes overhead. Recursive restartability [Candea01a] is an approach to system recovery that assumes that, in critical infrastructures, most bugs cause software to crash, deadlock, spin, leak memory, or otherwise fail in a way that leaves reboot or restart as the only means of restoring the system [Brewer01] Gray78] Reboots are an effective and ....
Candea, G.; Fox, A. Recursive restartability: turning the reboot sledgehammer into a scalpel. Proc. 8th Workshop on Hot Topics in Operating Systems, 2001. p.125-30.
....that integrates COTS components. A current goal in the design and deployment of Mercury is to improve ground station availability, as it was not originally designed with high availability in mind. Our first step in improving the availability of Mercury was to apply recursive restartability [4], an approach to system recovery that advocates curing transient failures by restarting suitably chosen subsystems, such that overall mean time to recover (MTTR) is minimized. Recursive restartability is a concrete example of the recovery oriented computing (ROC) philosophy [12] as applied to ....
....characteristics among its logical sub components, rearchitecting along the MTTR=MTTF separation lines may often turn out to be the optimal engineering choice. Balancing MTTR=MTTF characteristics in every component is a step toward building a more robust and highly available system. As explained in [4], RR attempts to exploit strong existing fault isolation boundaries, such as virtual memory, physical node separation, or kernel process control, leading to higher confidence that a sequence of restarts will effectively cure transients. To preserve this property, restartgroup boundaries should not ....
[Article contains additional citation context not shown here]
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 110--115, Elmau, Germany, May 2001.
....that reacts rapidly to subsystem unavailability, to decrease MTTR. Proactive rejuvenation of software components, to avert failures related to poor resource management, and thus increase MTBF. In this section we will summarize the recursive restartability concept and direct the reader to [2] for details. To define a recursively restartable (RR) system, we take both a functional and a constructional approach. From a functional point of view, a RR system is one in which collections of components subsystems can be graciously restarted with little or no advance warning. One possible way ....
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Workshop on Hot Topics in Operating Systems, Elmau, Germany, 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission to the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII).
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE Workshop on Hot Topics in Operating Systems, May 2001.
No context found.
Candea, G. and Fox, A. 2001. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE HOTOS. 125--132.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems, pages 125--132, 2001.
No context found.
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th HotOS, pages 125--130, 2001.
No context found.
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th HotOS, pages 125--130, 2001.
No context found.
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th HotOS, pages 125--130, 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), Schloss Elmau, Germany, May 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001. IEEE Computer Society.
No context found.
George Candea and Armando Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 125--130, Elmau, Germany, May 2001. IEEE Computer Society Press.
No context found.
George Candea and Armando Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 125--130, Elmau, Germany, May 2001. IEEE Computer Society Press.
No context found.
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In HotOS-VIII, 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE HOTOS, pages 125--132, May 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proc. HotOS-VIII, May 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 125--130, Elmau, Germany, May 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.
No context found.
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.
No context found.
George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS), pages 125-- 130, 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110-115, Schloss Elmau, Germany, May 2001.
No context found.
G. Candea and A. Fox, Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. HotOS-VIII, 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May 2001.
No context found.
G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 125--132, May 2001.
No context found.
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), 2001.
Online articles have much greater impact More about CiteSeer.IST at NUS Add search form to your site Submit documents Feedback
CiteSeer.IST at NUS - Copyright Penn State and NEC. Hosted by the School of Computing, National University of Singapore.