Abstract
Computer applications now span the globe, and incorporate devices ranging in size and power from watches to clustered supercomputers. The further a system reaches and the more its heterogeneity decreases, the more fragile (susceptible to exceptions and errors) it becomes. Every system we design and build is more likely than ever to encounter, and to have to recover from, unreliable communication. It is time for rollback-recovery techniques to become mainstream software design topics. This paper surveys the daunting volume of research literature that explores such techniques, concentrating on those approaches that can be implemented in any application environment (for example, those with no language dependencies). It splits these techniques into checkpoint-based and log-based techniques, and then subdivides each of those families. While this taxonomy alone is helpful, the authors go even deeper and analyze the key ideas underlying each technique, along with the problems that accompany their implementation.