Clustermatic!

 

Redesigning the Cluster Architecture

A project of the Cluster Research Lab in the Advanced Computing Laboratory at Los Alamos National Laboratory

 


News Flash:
Los Alamos National Laboratory's ASCI Lightning runs Clustermatic!


It has become widely recognized that cluster set up and management is extremely tedious and error-prone due to the inherent autonomy of the nodes in a cluster and the obtainable scale, soon to be in the thousands of nodes. For these same reasons, using a cluster is much more difficult than using a traditional supercomputer.

To attack these problems, we have redesigned the cluster architecture from low-level machine setup all the way to programming support. By replacing key components and adding vital functionality, we have increased reliability and efficiency, decreased autonomy, and brought forth new possibilities for cluster computing.

The new cluster architecture design replaces legacy mechanisms for booting (LinuxBIOS) and runs an operating system that provides a single system image of the entire cluster (BProc) (Figure 1). Contrast this with the traditional cluster architecture which is a loose coupling of many individual single user workstations (Figure 2).

The new cluster
architecture

Figure 1. For our new cluster architecture, only the front end is a fully loaded system. The cluster nodes themselves have only LinuxBIOS installed. They receive the kernel (BProc + Linux) from the front end.


The
traditional cluster architecture

Figure 2. For a traditional cluster configuration, each node is a fully loaded independent system.


Clustermatic: A complete cluster solution

Clustermatic is a collection of new technologies being developed specifically for our new cluster architecture. Each technology can be used separately, and thus does not preclude integration with other clustering efforts or even other types of computing environments. For example, BProc is being used in several production-grade clusters; LinuxBIOS is being sold in products such as web content caching appliances, DVD players, and fiber channel analyzers.

LinuxBIOS

LinuxBIOS replaces the normal BIOS bootstrap mechanism with a Linux kernel that can be booted from a cold start. Cluster nodes can now be as simple as they need to be -- perhaps as simple as a CPU and memory, no disk, no floppy, and no file system (though it does not preclude these things). As a side effect, they are up and running in under 3 seconds. More importantly, since the nodes are under the control of the operating system from power on, we have complete control over what happens next. And what happens next is that they contact special node and reboot a more sophisticated kernel.

BProc

The Beowulf Distributed Process Space (BProc) provides a single system image of the entire cluster. LinuxBIOS cluster nodes come up autonomously and contact the "front end" node which sends them a BProc kernel to boot and registers them as part of the cluster. Users run programs on the front end, which migrates the jobs to the other cluster nodes.

BProc itself consists of a small set of kernel modifications, utilities and libraries which allow a user to start processes on other machines in a cluster (including reboot). Remote processes started with this mechanism appear in the process table of the front end. This allows remote process management using the normal UNIX process control facilities. Signals are transparently forwarded to remote processes and exit status is received using the usual wait() mechanisms.

Other work

Clusters with thousands of nodes will experience failures as frequently as every five minutes. Programs will need to be much more resilient and run-through to completion despite failures. We are currently developing system-level services and application support for this run-through technology.

Supermon is a high speed cluster monitoring tool that can collect 1000 samples per node per second without noticeable affect on the cluster nodes. The data from Supermon can be used to monitor node health and perform remote node maintenance. In addition, the monitoring information can be used to predictively react to node failures.

Guard is an interactive debugger designed to support debugging on clusters and other types of parallel architectures. In addition, Guard is the first implementation of the debugging paradigm known as relative debugging, a technique that allows a user to compare data between two executing programs. Relative debugging was devised to aid the testing and debugging of programs that are either modified in some way, or are ported to other computer platforms.

For application support, we have added automatic checkpointing in the ZPL compiler. ZPL is a high level parallel programming language developed at the University of Washington. The compiler inserts checkpoint calls in the user's source code at places with a minimum number of live variables, greatly reducing the checkpoint size as compared to other systems that use the virtual memory system to checkpoint dirty pages. The compiler can also guarantee that there are no in-flight messages during the checkpoint; this eliminates the need for message logging for recovery.


Download The Clustermatic distribution CD:

Clustermatic 5 (November 2004)

This is an update of the Clustermatic software for Linux 2.6.x. Also, power ppc64 support has been added to this release.

LACSI '04 (October 2004)

This release is a slightly updated version of Clustermatic 4. It includes tutorial slides and examples in addition to the Clustermatic software.

Clustermatic 4 (November 2003)

This release adds AMD64 support.

LACSI '03 (October 2003)
Clustermatic 3 (November 2002)
Clustermatic 2 (March 2002) now with Power PC support!
Clustermatic 1 (Fall 2001)

Other software and links:

GM Route - a fast replacement GM network mapper
  • Fast Mapping on Myrinet Networks - a paper describing the mapping algorithm used by gm_route. This paper appeared in the Myrinet workshop at The 7th International Conference on High Performance Computing and Grid in Asia Pacific Region.
  • gm_route-1.1
    Changes from 1.0 to 1.1:
    • Added a beoboot node setup plugin version of gm_storeroute.
    • Fixed rc.gm_manager and install scripts so that this will work on SuSE as well as Red Hat Linux.
    • Fixed misc build issues on AMD64.
  • gm_route-1.0
    First release.


Clustermatic projects are sponsored in part by the Department of Energy's Office of Science and the ASCI Institues program. The Cluster Research Lab performs fundamental research in operating systems and clustering. For more information, contact us at: clustermatic at lanl.gov.


© 1999-2003 Clustermatic