Collaborative Research: Comprehensive Algorithmic

Resilience (CAR) for Big Data Analytics


Overview

With relentless manufacturing technology scaling coupled with increasing power densities, modern processors suffer from potential hardware and software failures, which may manifest themselves as errors. Processors errors can be broadly classified into two categories: soft errors and hard errors.

Big Data analytics is the process of mining useful knowledge in very large data sets, and many Big Data applications are built on the foundation of algorithms which have desirable properties that can be leveraged on for concurrent error detection, achieving algorithmic resilience during application execution. For example, off-line analytics common for various Big Data applications, like search engine (e.g., PageRank), social networks (e.g., K-Means), and e-commerce (e.g., Naïve Bayes) included in the recently introduced BigDataBench, exhibit certain invariants in the course of their long-running executions, and such invariants serve as the basis of our algorithmic resilience foundation. We hence seek comprehensive algorithmic resilience in support of Big Data analytics executed on computer systems with errors expected to occur in the course of analytics.

This research embraces an array of comprehensive algorithmic resilience (CAR) software techniques that include algorithmic error detection, coordinated checkpointing and execution recovery from detected errors, to arrive at high resilience for networked computer systems. Based on our prior work on adaptive incremental checkpointing (AIC), which follows effective prediction on checkpoint overhead during execution progress to determine the desirable points of time to take checkpoints adaptively, this research aims to (1) tailor AIC for long-running Big Data analytics when carried out on distributed-memory (DM) paradigms (commonly the case for analytics) following coordinated checkpointing and (2) investigate and evaluate execution recovery from the checkpointed files to a globally consistent state for all participating nodes after execution faults are detected.

We have completed investigation into a restore-express (REX) strategy for multi-level checkpointing (MLC),applicable to any IC (incremental checkpointing). With its primary goal of accelerating execution recovery from permanent failures, REX is valuable to computer systems upon running long-execution jobs since failure recovery is on the execution critical path. REX is considered a part of CAR software functionality.

In addition, we have addressed effective checkpointing control in real-world computing systems, whose mean time between failures (MTBF) can be either unknown a priori or subject to wide fluctuation in the course of job execution. Checkpointing control under our development is oblivious to MTBF, in contrast to earlier pursuits, which always assume the MTBF of a system to stay constant (designated by MTBF_ideal) throughout job execution for optimal checkpointing with a fixed inter-checkpointing interval (that minimizes the mean total execution time under MTBF_ideal).

In summary, the two collaborative teams aim to accomplish following research goals jointly in this project: (1) implementing algorithmic checkers for the BigDataBench codes to lift their resilience via concurrent error detection, (2) evaluating the resilient BigDataBench codes with AIC and REX incorporated in support of execution recovery from detected failures, (3) devising and assessing coordinated checkpointing in networked computer systems in terms of times for restore from failures and of checkpointing overhead amounts, (4) proposing and evaluating the bufferless NoC design which lowers traffic deflection for improved throughput and power savings, and (5) designing a log-based method to detect and recover potential soft errors for multithreaded applications.


301 E. Lewis St., University of Louisiana at Lafayette, Lafayette, LA 70503