45 citations found. Retrieving documents...
G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission to the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII).

 @ NUS  Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Automatic Data Structure Repair for Self-Healing Systems - Brian Demsky Massachusetts (2003)   (2 citations)  (Correct)

.... Computing Researchers in the area of recovery oriented computing have developed a variety of techniques to help software recover from runtime errors [14] One of these techniques, recursive restartability, composes large systems out of many smaller modules that are individually rebootable [4]. The goal is to build systems in which faults can be isolated at the module level by rebooting. In some cases, the consequences of an error may not be immediately apparent and the system may run ahead, generating an unacceptable execution. In such cases, the ability to undo an application s ....

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110-115, Schloss Elmau, Germany, May 2001.


Legba: Fast Hardware Support for Fine-Grained Protection - Wiggins, Winwood, Tuch.. (2003)   (1 citation)  (Correct)

.... dynamic extensibility has long been promoted as a way to manage the complexity, and improve maintainability and reliability of operating systems [2 5] Recently, the low reliability of some system components, particularly device drivers, has triggered renewed efforts to isolate such components [6, 7]. The common problem here is the need to isolate untrusted (buggy or potentially malicious) code. In addition, component technology [8 10] which is an attractive way of constructing extensions, is leading to a reduced granularity of the units of code and data that require protection or ....

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th HotOS, pages 125--130, 2001.


Using Remote Memory Communication for Self-Healing Systems - Sultan, Bohra, Neamtiu..   (Correct)

....introspection, externalize the execution state, and provide components that can be replaced to repair the state at the application level. We propose similar techniques to be applied for repairing state while executing, and retrieving useful state after failure in a complete computer system. In [5], an all or nothing approach is taken and the system is rebooted to bring it back to a consistent working state following a failure. Our approach addresses a missing link: we reuse useful state in the system memory through repair and recovery rather than lose it by rebooting. Self adapting ....

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proc. HotOS-VIII, May 2001.


Duplex: A Reusable Fault Tolerance Extension.. - Sharma, Chen, Li.. (2003)   (3 citations)  (Correct)

....logging requires the service to block till the log is written to stable storage. Optimistic logging allows the log message to be flushed asynchronously to stable storage while the process goes ahead with other operations. The concept of recursive restartability has recently gained popularity [5], in which a fault tolerant system allows restarting components at multiple levels depending upon severity of failure. In a similar spirit, Duplex API permits NIC NIC Persistent Storage Classification Rules and Configuration Packet Classifier and Forwarder Logging Management Module User ....

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 125--132, May 2001.


An Online Evolutionary Approach to Developing Internet.. - Chen, Kiciman, Brewer (2002)   (1 citation)  (Correct)

....and performance degradation still involve operators and developers in the feedback loop. After analyzing and forming a hypothesis of the system s behavior, we can close some of the feedback loops by providing a trigger for dynamic adaptation techniques. For software bugs, recursive restarts [6] bring the system back to a known, functioning state. For configuration errors, undo [5] helps the system configuration rollback to a previous working configuration. For an overloaded system, dynamic connection management [8] allows it to degrade gracefully by performing admission control or by ....

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS-VIII, 2001.


Improving Cluster Availability Using Workstation Validation - Heath, Martin, Nguyen (2002)   (2 citations)  (Correct)

....70 days [23] which is far longer than our observed average 14 day reboot interval. Other factors leading to rejuvenation being effective for operating systems require much greater mean uptimes. Finally, recent work has proposed extending the idea of rejuvenation throughout all layers of software [5]. While all these works will improve software robustness (and will likely mask software failures that do not lead to node reboot) our characterization shows that they are unlikely to improve the masking of node failures as the entire workstation node increases in availability with time. The ....

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May 2001.


Accepting Failure: Availability through Repair-centric System Design - Brown (2001)   (Correct)

....in todays Internet services world, where there are so many people in the gold rush to get their applications online first, according to a division head of Suns consulting wing. Major internet portals are deploying code written by gumshoe engineers with little more than a week of job experience [8]. In the words of Debra Chrapraty, former CIO of E Trade, a major online brokerage service, We used to have six months of development on a product and three months of testing. We don t live that way any more. In Internet time, people get sloppy [41] In summary, blind adherence to this ....

....be reloaded upon restart. This design acknowledges that application code can be buggy and provides fast restart and recovery mechanisms. More recent work has attempted to formalize the properties of such restartable systems and to devise the most appropriate ways to perform restart based recovery [8] [28] Furthermore, anecdotal reports suggest that these design techniques are used by production Internet services [8] Our proposed work rests on the same philosophical underpinnings as this previous work, but goes beyond it in two ways. First, we include a focus on the sorely neglected problem ....

[Article contains additional citation context not shown here]

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission


An Online Evolutionary Approach to Developing Internet Services - Chen, Kcman, Brewer (2002)   (1 citation)  (Correct)

....evolution makes it possible to automate many parts of the feedback loops with confidence. After analyzing and forming a hypothesis of the system s behavior, we can close some of the feedback loops by providing a trigger for dynamic adaptation techniques. For software bugs, recursive restarts [6] bring the system back to a known, functioning state. For configuration errors, undo [5] helps the system configuration rollback to a previous working configuration. For an overloaded system, dynamic connection management [8] allows it to degrade gracefully by performing admission control or by ....

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS-VIII, 2001.


Embracing Failure: A Case for Recovery-Oriented Computing (ROC) - Brown, Patterson (2001)   (8 citations)  (Correct)

.... changes, so does software, and thus the traditional highavailability design techniques of careful software engineering and extensive software testing go out the window [6] Major internet portals are deploying code written by gumshoe engineers with little more than a week of job experience [2]. In the words of Debra Chrapraty, former CIO of E Trade, a major online brokerage service, We used to have six months of development on a product and three months of testing. We don t live that way any more. In Internet time, people get sloppy [10] When people are sloppy, software bugs ....

....failures. We call this philosophy Recovery Oriented Computing (ROC) At its heart, ROC addresses hardware, software, and human failures by providing rapid and effective mechanisms for detecting and recovering from them. Recovery can take many forms from simple mechanisms like design for reboot [2], to more complex schemes such as fail stop fault containment combined with data redundancy, to full regeneration of system state from backups, checkpoints, and logs. A full discussion is outside the scope of this paper. But in all cases, these mechanisms should be designed to make as few ....

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission to the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII).


Microreboot - A Technique for Cheap Recovery - Candea, Kawamoto, Fujiki.. (2004)   (4 citations)  Self-citation (Candea Fox)   (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau, Germany, 2001.


Crash-Only Software - Candea, Fox (2003)   (3 citations)  Self-citation (Candea Fox)   (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau/Oberbayern, Germany, 2001.


Using Runtime Paths for Macro Analysis - Mike Chen Emre (2003)   (6 citations)  Self-citation (Fox)   (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS VIII, 2001.


Using Runtime Paths for Macroanalysis - Mike Chen Emre (2003)   (6 citations)  Self-citation (Fox)   (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In HotOS VIII, 2001.


JAGR: An Autonomous Self-Recovering Application Server - Candea, Kiciman, Zhang.. (2003)   (2 citations)  Self-citation (Candea Fox)   (Correct)

....failures. To this end, ROC researchers and others have been investigating techniques for failure detection and recovery that are external to the application and do not rely on a priori fault models or models of the application s semantics; examples include recursive restarts as a form of recovery [4] and anomaly detection in runtime path analysis as a form of failure detection [7] Given the emergence of popular middleware platforms for Internet Web based applications, such as Java 2 Enterprise Edition (J2EE) we believe these techniques can be applied today to the middleware platform ....

....(RM) is an entity external to the application server. It attempts automated recovery and only involves system administrators when automated recovery is unsuccessful. We rely on micro reboots for recovery; micro rebooting is a good way to recover from most transient failures in Internet systems [4]. The recovery manager listens on a UDP port for failure notifications from the monitors. Using failure information, it builds up a representation of the failure propagation paths through the system in the form of a graph, whose structure is described in section 3.1. After updating the graph, the ....

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau/Oberbayern, Germany, 2001.


Improving Availability with Recursive Micro-Reboots: A.. - Candea, Cutler, Fox (2003)   (1 citation)  Self-citation (Candea Fox)   (Correct)

....station that integrates COTS components. A current goal in the design and deployment of Mercury is to improve ground station availability, as it was not originally designed with high availability in mind. Our first step in improving the availability of Mercury was to apply recursive reboots [19] to cure transient failures by restarting suitably chosen subsystems, such that overall mean time to recover (MTTR) is minimized. We had two main goals in applying RR to Mercury. The first was to partially remove the human from the loop in ground station control by automating recovery from ....

....among its logical sub components, rearchitecting along the MTTR MTTF separation lines may often turn out to be the optimal engineering choice. Balancing MTTR MTTF characteristics in every component is a step toward building a more robust and highly available system. As explained in [19], RR attempts to exploit strong existing fault isolation boundaries, such as virtual memory, physical node separation, or kernel process control, leading to higher confidence that a sequence of restarts will effectively cure transients. To preserve this property, recovery group boundaries should ....

[Article contains additional citation context not shown here]

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th Workshop on Hot Topics in Operating Systems, Elmau/Oberbayern, Germany, 2001.


Toward Recovery-Oriented Computing - Fox (2002)   (1 citation)  Self-citation (Fox)   (Correct)

....are probably not newsworthy because they affect far fewer users. The difference, of course, is that in the latter case the same availability is achieved by having a much shorter MTTR. Frequent recovery may lengthen effective MTTF. Software rejuvenation [Garg97] and recursive restartability [Candea01] both exploit the observation that by returning a system periodically to its start state (typically a well understood and heavily tested state) we can reclaim stale resources, clean up corrupted state and other side effects of software aging, and eliminate the corresponding side effects (e.g. ....

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. Proc. Eighth Workshop on Hot Topics in Operating Systems (HotOS-VIII), Elmau, Germany, May 2001.


Reducing Recovery Time in a Small Recursively.. - Candea, Cutler, Fox, .. (2002)   (1 citation)  Self-citation (Candea Fox)   (Correct)

....that integrates COTS components. A current goal in the design and deployment of Mercury is to improve ground station availability, as it was not originally designed with high availability in mind. Our first step in improving the availability of Mercury was to apply recursive restartability [4], an approach to system recovery that advocates curing transient failures by restarting suitably chosen subsystems, such that overall mean time to recover (MTTR) is minimized. Recursive restartability is a concrete example of the recovery oriented computing (ROC) philosophy [12] as applied to ....

....characteristics among its logical sub components, rearchitecting along the MTTR MTTF separation lines may often turn out to be the optimal engineering choice. Balancing MTTF characteristics in every component is a step toward building a more robust and highly available system. As explained in [4], RR attempts to exploit strong existing fault isolation boundaries, such as virtual memory, physical node separation, or kernel process control, leading to higher confidence that a sequence of restarts will effectively cure transients. To preserve this property, restartgroup boundaries should not ....

[Article contains additional citation context not shown here]

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 110--115, Elmau, Germany, May 2001.


Recovery Oriented Computing (ROC): Motivation.. - Patterson, Brown, .. (2002)   (5 citations)  Self-citation (Candea)   (Correct)

....the control software by a factor of almost six. Besides being a significant quantitative improvement, it also constituted a qualitative improvement that lead to nearly 100 availability of the ground station during the critical period when the satellite passes overhead. Recursive restartability [Candea01a] is an approach to system recovery that assumes that, in critical infrastructures, most bugs cause software to crash, deadlock, spin, leak memory, or otherwise fail in a way that leaves reboot or restart as the only means of restoring the system [Brewer01] Gray78] Reboots are an effective and ....

Candea, G.; Fox, A. Recursive restartability: turning the reboot sledgehammer into a scalpel. Proc. 8th Workshop on Hot Topics in Operating Systems, 2001. p.125-30.


Reducing Recovery Time in a Small Recursively.. - Candea, Cutler, Fox, .. (2002)   (1 citation)  Self-citation (Candea Fox)   (Correct)

....that integrates COTS components. A current goal in the design and deployment of Mercury is to improve ground station availability, as it was not originally designed with high availability in mind. Our first step in improving the availability of Mercury was to apply recursive restartability [4], an approach to system recovery that advocates curing transient failures by restarting suitably chosen subsystems, such that overall mean time to recover (MTTR) is minimized. Recursive restartability is a concrete example of the recovery oriented computing (ROC) philosophy [12] as applied to ....

....characteristics among its logical sub components, rearchitecting along the MTTR=MTTF separation lines may often turn out to be the optimal engineering choice. Balancing MTTR=MTTF characteristics in every component is a step toward building a more robust and highly available system. As explained in [4], RR attempts to exploit strong existing fault isolation boundaries, such as virtual memory, physical node separation, or kernel process control, leading to higher confidence that a sequence of restarts will effectively cure transients. To preserve this property, restartgroup boundaries should not ....

[Article contains additional citation context not shown here]

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 110--115, Elmau, Germany, May 2001.


Designing for High Availability and Measurability - Candea, Fox (2001)   Self-citation (Candea Fox)   (Correct)

....that reacts rapidly to subsystem unavailability, to decrease MTTR. Proactive rejuvenation of software components, to avert failures related to poor resource management, and thus increase MTBF. In this section we will summarize the recursive restartability concept and direct the reader to [2] for details. To define a recursively restartable (RR) system, we take both a functional and a constructional approach. From a functional point of view, a RR system is one in which collections of components subsystems can be graciously restarted with little or no advance warning. One possible way ....

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Workshop on Hot Topics in Operating Systems, Elmau, Germany, 2001.


Embracing Failure: - Case For Repair-Centric (2001)   (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Submission to the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII).


Recovering Device Drivers - Michael Swift Muthukaruppan (2004)   (6 citations)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE Workshop on Hot Topics in Operating Systems, May 2001.


Improving the Reliability of Commodity Operating - Systems Michael Swift   (Correct)

No context found.

Candea, G. and Fox, A. 2001. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE HOTOS. 125--132.


A Survey of Fault-Tolerance and Fault-Recovery Techniques in.. - Treaster (2005)   (2 citations)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems, pages 125--132, 2001.


Legba: Fast Hardware Support for Fine-Grained Protection - Wiggins, Winwood, Tuch.. (2003)   (1 citation)  (Correct)

No context found.

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th HotOS, pages 125--130, 2001.


User-level Device Drivers: Achieved Performance - Leslie, Chubb, Fitzroy-Dale, .. (2005)   (1 citation)  (Correct)

No context found.

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th HotOS, pages 125--130, 2001.


Legba: Fast Hardware Support for Fine-Grained Protection - Wiggins, Winwood, Tuch.. (2003)   (1 citation)  (Correct)

No context found.

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proc. 8th HotOS, pages 125--130, 2001.


Enhancing Server Availability and Security through .. - Rinard, Cadar.. (2004)   (3 citations)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.


Data Structure Repair Using Goal-Directed Reasoning - Brian Demsky Massachusetts   (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), Schloss Elmau, Germany, May 2001.


Using Execution Transactions To Recover From Buffer.. - Stelios Sidiroglou.. (2004)   (1 citation)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001. IEEE Computer Society.


Solar: Building a Context Fusion Network for Pervasive Computing - Chen (2004)   (Correct)

No context found.

George Candea and Armando Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 125--130, Elmau, Germany, May 2001. IEEE Computer Society Press.


Solar: Building a Context Fusion Network for Pervasive Computing - Chen (2004)   (Correct)

No context found.

George Candea and Armando Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 125--130, Elmau, Germany, May 2001. IEEE Computer Society Press.


A Review of Software Upgrade Techniques for Distributed Systems - Ajmani (2004)   (1 citation)  (Correct)

No context found.

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In HotOS-VIII, 2001.


Automatic Detection and Repair of Errors in Data - Structures Brian Demsky   (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.


Improving the Reliability of Commodity Operating Systems - Swift, Bershad, Levy (2003)   (12 citations)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE HOTOS, pages 125--132, May 2001.


Nonintrusive Remote Healing Using Backdoors - Florin Sultan Aniruddha (2003)   (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proc. HotOS-VIII, May 2001.


Dependency Management in Distributed Settings - Chen, Kotz (2004)   (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, pages 125--130, Elmau, Germany, May 2001.


Static Specification Analysis for Termination of - Specification-Based Data..   (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.


Acceptability-Oriented Computing - Martin Rinard Mit (2003)   (Correct)

No context found.

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110--115, Schloss Elmau, Germany, May 2001.


A Survey on the Interaction between Caching, Translation and.. - Wiggins (2003)   (Correct)

No context found.

George Candea and Armando Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS), pages 125-- 130, 2001.


Automatic Data Structure Repair for Self-Healing Systems - Demsky, Rinard (2003)   (2 citations)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 110-115, Schloss Elmau, Germany, May 2001.


Phoenix Application Recovery Project - Roger Barga Database   (Correct)

No context found.

G. Candea and A. Fox, Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. HotOS-VIII, 2001.


Impact of Space-Time Multiplexing Granularity on.. - Chandra, Goyal, Shenoy   (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May 2001.


Duplex: A Reusable Fault Tolerance Extension.. - Sharma, Chen, Li.. (2003)   (3 citations)  (Correct)

No context found.

G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 125--132, May 2001.


ROC-1: Hardware Support for Recovery-Oriented Computing - Oppenheimer, Brown.. (2002)   (1 citation)  (Correct)

No context found.

G. Candea and A. Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), 2001.

Online articles have much greater impact   More about CiteSeer.IST at NUS   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST at NUS - Copyright Penn State and NEC. Hosted by the School of Computing, National University of Singapore.