Future of Information and Communication Conference (FICC) 2025
28-29 April 2025
Publication Links
IJACSA
Special Issues
Future of Information and Communication Conference (FICC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 9 Issue 12, 2018.
Abstract: Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture (CUDA)-based Graphics Processing Units (GPUs). In the first level, it detects the existence of errors by using software redundancy that applies design diversity. In the second level, it investigates the problematic software version and re-executes it on a different hardware component to classify whether the error is a permanent hardware error or a software error. We implemented our approach to run on GPUs and conducted proof of concept experiments by running three versions of matrix multiplications with different error scenarios and results show the feasibility of the proposed approach.
Aishah M. Aseeri and Mai A. Fadel, “A Two-Level Fault-Tolerance Technique for High Performance Computing Applications” International Journal of Advanced Computer Science and Applications(IJACSA), 9(12), 2018. http://dx.doi.org/10.14569/IJACSA.2018.091207
@article{Aseeri2018,
title = {A Two-Level Fault-Tolerance Technique for High Performance Computing Applications},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2018.091207},
url = {http://dx.doi.org/10.14569/IJACSA.2018.091207},
year = {2018},
publisher = {The Science and Information Organization},
volume = {9},
number = {12},
author = {Aishah M. Aseeri and Mai A. Fadel}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.