Grid and Cluster Computing

Nian-Feng Tzeng
Center for Advanced Computer Studies
University of Louisiana at Lafayette

Grid Computing

With large-scale resource sharing, grid computing has changed the way we handle computation and data storage, potentially enabling us to better tackle computationally intensive tasks and storage-demanding applications. A computational grid often consists of diverse resources aggregated together, and it naturally faces various technical challenges, including resource tracking and management, job assighments and task migration, security, and user interfaces & tools. We have dealt with real-time job scheduling and effective resource tracking & management for grid computing. Checkpointing and task migration in computational grids are being investigated.

Cluster Computing

Networks of workstations (NOWs) and massively parallel processor (MPP) systems are popular platforms for cluster computing. They mostly belong to the class of distributed memory machines. Distributed Shared Memory (DSM) Systems build the shared memory abstract on top of the distributed memory machines, such that the users have a shared-memory environment while message passing are taken care of by the DSM layer. Performance and reliability improvement of software DSM on NOWs and MPP systems for cluster computing has been investigated.

Software DSM Performance Improvement
Software DSM performance is particularly sensitive to message communication overhead incurred for memory coherence enforcement. An aggressive release consistency method was considered to reduce this overhead. It was found to achieve considerable overhead reduction through a real implementation of the considered method on the TreadMarks framework for detailed and thorough evaluation under different applications. The synchronization mechanism is used for protecting the shared variables from data race situation, which leads to an unpredictable result due to multiple accesses. Since performance of DSM is based on the number of messages exchanged during the runtime and synchronization mechanism is accounted for most of the messages, efficient synchronization mechanisms help further overall performance, as revealed and detailed in our studies.
In order to observe the actual behaviors of our proposed synchronization techniques, we used NOWs and an MPP system as the platforms for our experiments. NOWs employed for study are housed in the CAN Lab. The MPP system used in our works is the IBM SP machine provided by Mathematics and Computer Science Division at Argonne National Laboratory.
Recoverable Software DSM
While software DSM continues to improve its performance and scalability, the probability of system failures increases as its size grows. It is highly desirable to develop effective mechanisms for supporting fast crash recovery from node failures, giving rise to recoverable software DSM. We have proposed efficient logging and recovery protocols for software DSM, such that they incur negligible overhead during fault-free execution while shortening the recovery time significantly after failures arise.
A new and efficient coordinated checkpointing technique was introduced for software DSM, in order to minimize both the checkpointing overhead during failure-free execution, and the cost of recovery from failures. This is realized by leveraging existing coherence information maintained by software DSM, so as to allow failure recovery from the most recent checkpoint, saving the re-computation time. Performance experiments on NOWs with the introduced technique incorporated into the TreadMarks framework have demonstrated its ability in consistently outperforming other checkpointing and fault recovery approaches.

Publications

A. Kongmunvattana, S. Tanchatchawal, and N.-F. Tzeng, Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems, Proc. 20th Int'l Conference on Distributed Computing Systems (ICDCS 2000), Apr. 2000, pp. 556-563. Abstract, PDF (87K).
A. Kongmunvattana and N.-F. Tzeng, Logging and Recovery in Adaptive Software Distributed Shared Memory Systems, Proc. 18th IEEE Symposium on Reliable Distributed Systems (SRDS), Oct. 1999, pp. 202-211. Abstract, PDF (113K).
A. Kongmunvattana and N.-F. Tzeng, Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory, Proc. 1999 Int'l Conference on Parallel Processing (ICPP '99), Sept. 1999, pp. 274-281. Abstract, PDF (96K).
A. Kongmunvattana and N.-F. Tzeng, Lazy Logging and Prefetch-Based Crash Recovery in Software Distributed Shared Memory Systems, Proc. 13th Int'l Parallel Processing Symposium (IPPS '99), Apr. 1999, pp. 399-406. Abstract, PDF (123K).
N.-F. Tzeng and A. Kongmunvattana, Distributed Shared Memory Systems with Improved Barrier Synchronization and Data Transfer, Proc. 11th ACM Int'l Conference on Supercomputing (ICS), July 1997, pp. 148-155. Abstract, Compressed PDF (1.1M).
S. S. Fu and N.-F. Tzeng, Aggressive Release Consistency for Software Distributed Shared Memory, Proc. 17th Int'l Conference on Distributed Computing Systems, May 1997, pp. 288-295. Abstract, Compressed Postscript (101K).
S. S. Fu, N.-F. Tzeng, and Z. Li, Empirical Evaluation of Distributed Mutual Exclusion Algorithms, Proc. 11th Int'l Parallel Processing Symposium, April 1997, pp. 255-259. Abstract, Compressed Postscript (60K).
S. S. Fu and N.-F. Tzeng, Lock Improvement Technique for Release Consistency in Distributed Shared Memory Systems, Proc. 6th Symposium on the Frontiers of Massively Parallel Computation, October 1996, pp. 255-262. Abstract, Compressed Postscript (55K).

Funding

National Science Foundation under Grants MIP-9201308, CCR-9803505, and EIA-9871315.
Board of Regents, State of Louisiana under Contract No. LEQSF(1998-99)-ENH-TR-101.

Send e-mail to: tzeng@cacs.louisiana.edu