11 citations found. Retrieving documents...
T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proc. of ACM SIGMETRICS, 2002.

 @ NUS  Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
State Maintenance and its Impact on the.. - Gama, Nagaraja..   Self-citation (Martin Nguyen)   (Correct)

No context found.

Taliver Heath, Richard Martin, and Thu D. Nguyen. Improving Cluster Availability Using Workstation Validation. In Proceedings of the ACM SIGMETRICS 2002.


Evaluating the Impact of Communication.. - Nagaraja.. (2003)   Self-citation (Martin Nguyen)   (Correct)

....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [11, 12, 15, 21, 35, 34, 36]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the application fault rate between these ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In Proceedings of the ACM SIGMETRICS 2002.


Using Fault Injection and Modeling to Evaluate the .. - Nagaraja, Li.. (2003)   (2 citations)  Self-citation (Martin Nguyen)   (Correct)

....rutgers.edu Research mendosus . Table 5 provides a flavor of this data, listing the throughput and duration of each phase of our 7 stage model for VIA PRESS for two types of faults. The MTTFs and MTTRs shown in Table 4 were chosen based on previously reported faults and fault rates [13, 16, 32]. Note that we do not model all the faults that we can inject because there are no reliable statistics for some of them, e.g. application hangs. Finally, our environmental assumptions are that operator response time for stage E is 5 minutes and cluster reset time for stage F is 5 minutes. Recall ....

....these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. There has also been a large number of system availability studies. Two approaches that are used most often include empirical measurements of actual fault rates [3, 13, 20, 16, 23] and a rich set of stochastic process models that describe system dependencies, fault likelihoods over time, and performance [10, 21, 30] Compared to these complex stochastic models, our models are much simpler, and thus more accessible to practitioners. This stems from our more limited goal of ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In Proceedings of the ACM SIGMETRICS 2002.


Quantifying and Improving the Availability of.. - Nagaraja.. (2003)   Self-citation (Martin Nguyen)   (Correct)

....Application hang 2 months 3 minutes Front end failure 6 months 3 minutes Table 1: Failures and their MTTFs and MTTRs. Application hang and crash together represent an MTTF of 1 month for application failures. rates from previous works which empirically observed the fault rates of many systems [2, 16, 23, 18, 24]. We use Mendosus [21] to inject the expected fault load. Mendosus s network emulation system allows us to differentiate between intra cluster communication and client server communication when injecting network related faults. Thus, the clients are never disturbed by faults injected into the ....

....of this body of work is beyond the scope of this paper. Instead, we concentrate on efforts that have focused on improving the availability of cluster based services. Of course, work analyzing how faults impact systems [14, 19, 31, 32] as well as empirical measurement of actual fault rates [2, 16, 23, 18, 24], are necessary background for a model based quantification effort such as ours. Our methodology and infrastructure seem to be the first directed to quantifying the availability impact of a range of techniques as applied to cluster based services. One of the first works on the subject [13] argued ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In Proceedings of the ACM SIGMETRICS 2002.


Evaluating the Impact of Communication.. - Nagaraja.. (2002)   Self-citation (Martin Nguyen)   (Correct)

....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [11, 12, 15, 21, 35, 34, 36]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the application fault rate between these ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In Proceedings of the ACM SIGMETRICS 2002.


Using Fault Model Enforcement to Improve Availability - Nagaraja, Bianchini.. (2002)   (4 citations)  Self-citation (Martin Nguyen)   (Correct)

....one must further abstract away from reality. To reason about a real system, we usually model all components as fail stop. We also hope that failure rates and recovery occur with exponential distributions, even though there is strong empirical evidence against this, at least for workstations [17], 18] III. FAULT MODEL ENFORCEMENT Given the previous description of complex computer systems, creating reasonably accurate abstractions of them seems to be an impossibly complex task. There are too many different subsystems, none of which any one person fully understands, connected by ....

....versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [25] 26] 27] 18] 12] 28] [17]. A duration of 5 minutes was assumed for the operator intervention stage E and restart stage F. B. Evaluation Metrics Our model computes two metrics to evaluate each server. The first is the unavailability, which is the average fraction of requests dropped. We use unavailability instead of ....

Taliver Heath, Richard Martin, and Thu D. Nguyen, "Improving Cluster Availability Using Workstation Validation," in to appear in Proceedings of the ACM SIGMETRICS 2002.


Evaluating the Impact of Communication.. - Nagaraja.. (2003)   Self-citation (Martin Nguyen)   (Correct)

....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [41, 13, 16, 27, 42, 43, 20]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the fault rate between these errors ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In to appear in Proceedings of the ACM SIGMETRICS 2002.


Using Fault Injection to Evaluate the.. - Nagaraja, Li.. (2003)   (3 citations)  Self-citation (Martin Nguyen)   (Correct)

....rutgers.edu Research mendosus . Table 5 provides a flavor of this data, listing the throughput and duration of each phase of our 7 phase model for VIA PRESS for two types of faults. The MTTFs and MTTRs shown in Table 6 were chosen based on previously reported faults and fault rates [14, 16, 28]. Note that we do not model all the faults that we can inject because there are no reli12 Phase Switch Failure Application Crash Throughput (reqs sec) Duration (secs) Throughput (reqs sec) Duration (secs) A 892.40 75 1889.10 10 B 0 3143.55 145 C 1106.70 3525 4537.60 25 D 0 4789.13 45 E ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In to appear in Proceedings of the ACM SIGMETRICS 2002.


Using Fault Injection to Evaluate the.. - Nagaraja, Li.. (2003)   (3 citations)  Self-citation (Martin Nguyen)   (Correct)

....rutgers.edu Research mendosus . Table 5 provides a flavor of this data, listing the throughput and duration of each phase of our 7 phase model for VIA PRESS for two types of faults. The MTTFs and MTTRs shown in Table 6 were chosen based on previously reported faults and fault rates [14, 16, 28]. Note that we do not model all the faults that we can inject because there are no reli12 Phase Switch Failure Application Crash Throughput (reqs sec) Duration (secs) Throughput (reqs sec) Duration (secs) A 892.40 75 1889.10 10 B 0 3143.55 145 C 1106.70 3525 4537.60 25 D 0 4789.13 45 E ....

T. Heath, R. Martin, and T. D. Nguyen. Improving Cluster Availability Using Workstation Validation. In to appear in Proceedings of the ACM SIGMETRICS 2002.


A Large-Scale Study of Failures in High-Performance.. - Bianca Schroeder Garth   (Correct)

No context found.

T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proc. of ACM SIGMETRICS, 2002.


A Large-Scale Study of Failures in High-Performance.. - Bianca Schroeder Garth   (Correct)

No context found.

T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. In Proc. of ACM SIGMETRICS, 2002.

Online articles have much greater impact   More about CiteSeer.IST at NUS   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST at NUS - Copyright Penn State and NEC. Hosted by the School of Computing, National University of Singapore.