Grid and Cluster Computing
Nian-Feng Tzeng
Center for Advanced Computer Studies
University of Louisiana at Lafayette
Grid Computing
With large-scale resource sharing, grid computing has changed the way we handle computation and data storage, potentially enabling us to better tackle computationally intensive tasks and storage-demanding applications.
A computational grid often consists of diverse resources aggregated together, and it naturally faces various technical challenges, including resource tracking and management, job assighments and task migration, security, and user interfaces & tools.
We have dealt with real-time job scheduling and effective resource tracking & management for grid computing.
Checkpointing and task migration in computational grids are being investigated.
Cluster Computing
Networks of workstations (NOWs) and massively parallel processor (MPP) systems
are popular platforms for cluster computing. They mostly belong to the class
of distributed memory machines.
Distributed Shared Memory (DSM) Systems build the shared memory abstract on top of the distributed memory machines, such that the users
have a shared-memory environment while message passing are taken care of
by the DSM layer.
Performance and reliability improvement of software DSM on NOWs and MPP
systems for cluster computing has been investigated.
- Software DSM Performance Improvement
Software DSM performance is particularly sensitive to message communication
overhead incurred for memory coherence enforcement.
An aggressive release consistency method was considered to reduce this overhead.
It was found to achieve considerable overhead reduction through a real
implementation of the considered method on the TreadMarks framework for
detailed and thorough evaluation under different applications.
The synchronization mechanism is used for protecting the shared variables
from data race situation, which leads to an unpredictable result
due to multiple accesses.
Since performance of DSM is based on the number of messages exchanged
during the runtime and synchronization mechanism is accounted for
most of the messages, efficient synchronization mechanisms
help further overall performance, as revealed and detailed in our studies.
In order to observe the actual behaviors of our proposed
synchronization techniques, we used NOWs and an MPP system as the platforms
for our experiments. NOWs employed for study are housed in the
CAN Lab.
The MPP system used in our works is the IBM SP machine provided by
Mathematics and Computer Science Division at Argonne National Laboratory.
- Recoverable Software DSM
While software DSM continues to improve its performance and scalability,
the probability of system failures increases as its size grows.
It is highly desirable to develop effective mechanisms for supporting
fast crash recovery from node failures, giving rise to recoverable software DSM.
We have proposed efficient logging and recovery protocols for software DSM,
such that they incur negligible overhead during fault-free execution
while shortening the recovery time significantly after failures arise.
A new and efficient coordinated checkpointing technique was introduced for
software DSM, in order to minimize both the checkpointing overhead during
failure-free execution, and the cost of recovery from failures.
This is realized by leveraging existing coherence information maintained by
software DSM, so as to allow failure recovery from the most recent checkpoint,
saving the re-computation time.
Performance experiments on NOWs with the introduced technique incorporated
into the TreadMarks framework have demonstrated its ability in consistently
outperforming other checkpointing and fault recovery approaches.
Publications
- A. Kongmunvattana, S. Tanchatchawal, and N.-F. Tzeng, Coherence-Based Coordinated Checkpointing for Software Distributed Shared Memory Systems,
Proc. 20th Int'l Conference on Distributed Computing Systems (ICDCS 2000), Apr. 2000, pp. 556-563.
Abstract,
PDF (87K).
- A. Kongmunvattana and N.-F. Tzeng, Logging and Recovery in Adaptive Software Distributed Shared Memory Systems,
Proc. 18th IEEE Symposium on Reliable Distributed Systems (SRDS), Oct. 1999, pp. 202-211.
Abstract,
PDF (113K).
- A. Kongmunvattana and N.-F. Tzeng, Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory,
Proc. 1999 Int'l Conference on Parallel Processing (ICPP '99), Sept. 1999, pp. 274-281.
Abstract,
PDF (96K).
- A. Kongmunvattana and N.-F. Tzeng, Lazy Logging and Prefetch-Based Crash Recovery in Software Distributed Shared Memory Systems,
Proc. 13th Int'l Parallel Processing Symposium (IPPS '99), Apr. 1999, pp. 399-406.
Abstract,
PDF (123K).
- N.-F. Tzeng and A. Kongmunvattana, Distributed Shared Memory Systems with Improved Barrier Synchronization and Data Transfer,
Proc. 11th ACM Int'l Conference on Supercomputing (ICS), July 1997, pp. 148-155.
Abstract,
Compressed PDF (1.1M).
- S. S. Fu and N.-F. Tzeng,
Aggressive Release Consistency for Software Distributed Shared
Memory,
Proc. 17th Int'l Conference on Distributed Computing Systems, May 1997, pp. 288-295.
Abstract,
Compressed Postscript (101K).
- S. S. Fu, N.-F. Tzeng, and Z. Li,
Empirical Evaluation of Distributed Mutual Exclusion Algorithms,
Proc. 11th Int'l Parallel Processing Symposium, April 1997, pp. 255-259.
Abstract,
Compressed Postscript (60K).
- S. S. Fu and N.-F. Tzeng,
Lock Improvement Technique for Release Consistency in Distributed
Shared Memory Systems,
Proc. 6th Symposium on the Frontiers
of Massively Parallel Computation, October 1996, pp. 255-262.
Abstract,
Compressed Postscript (55K).
Funding
- National Science Foundation under Grants
MIP-9201308,
CCR-9803505, and
EIA-9871315.
- Board of Regents, State of Louisiana under Contract No. LEQSF(1998-99)-ENH-TR-101.
Send e-mail to:
tzeng@cacs.louisiana.edu