Current multicomputers can be classified as coarse-grain systems in which tens to hundreds of processors are relatively loosely coupled, communicating using (optimally) large messages. At one end of this grouping are workstation clusters and machines like the IBM SP-2, which run a complete operating system on each node; at the other end are machines like the Cray T3D, Intel Delta/Paragon, or KSR (which can also be programmed using a message-passing model) which are more tightly coupled and use a distributed OS or a simple runtime system.
A major shift of emphasis in architectural design for parallel computers is occurring, however. New fine-grain multicomputers built from thousands of processors, coupled using a low-latency, small, active message paradigm, show great promise in improving the capability of parallel systems. Their primary advantages lie in the ability to scale well to orders of magnitude more processors than existing machines, the low overhead cost of the active-message communications, and the ease of implementation of the object-oriented paradigm that is becoming the fundamental basis of software technology. This new class of machines is exemplified by the MIT J-Machine prototype.
It is not yet clear how to use these new machines for non-trivial applications, how best to program them, or how best to compile programs for them. Previous work in this area has led to the development of prototype programming systems for the J-Machine and portable programming abstractions that can be used to implement a wide variety of irregular concurrent algorithms. These methods are now at the point where they can be applied to real-world problems in materials simulation.
The computational routines developed from the KSR code have been incorporated into a second code currently running on the Caltech/CSC Intel Touchstone Delta and Paragon XP/S machines, and potentially portable to the JPL Cray T3D. This code uses a new communications architecture based on a portable active message library, and thus also compiles on the prototype MIT J-Machine. It is suitable for calculations on very large-scale systems of up to 20 million atoms. The code also serves as a testbed for flexible optimization strategies designed to give efficient performance on traditional coarse-grain as well as fine-grain processors despite the highly asynchronous, multithreaded nature of the communications strategy.
Note that the cell centers need to be sent up the tree before the multipoles because each parent cell needs to determine what its center is before it can process incoming multipoles from its children.
Additional message types are used for initialization, synchronization, flow control, and atom reassignment.
Currently, the active message model is the preferred model for programming experimental fine grain parallel processing hardware such as the MIT J-Machine and M-Machine [1,2,3,4,5], which are among the targeted platforms for this code.
As an example, the second type of message from the previous section, involving
the communication of multipoles from a cell to its parent. This is implemented
through an active message of type
CHILD, with the destination node,
destination (parent) cell, and sending cell's multipoles as arguments. When
such a message is received on the destination node, it triggers the invocation
child() routine which parses the arguments, obtains the
appropriate destination cell, and calls a purely computational routine,
identical with the KSR code's corresponding routine, to combine the child's
multipoles with the parent's.
Although active messages have not been extensively used on traditional multicomputers such as the Intel's Delta and Paragon [6,7], we have found that refinements can be added to improve performance on such architectures, primarily by gathering together multiple small messages into a few large messages. This buffering significantly reduces per-byte overhead; improvements in wall clock time of factors of 2 to 5 for a five million atom case were observed when buffering was implemented.
Since the fundamental messaging primitives have no flow control, we implemented a flow control system on top of the active messages. Each processor is allowed to send a certain number of messages (a window) to each other processor in the system. When the window is full, it must wait until it has received acknowledgements before sending more messages. Keeping the windows on a per-processor basis allows more messages to be outstanding (not acknowledged) than if a single window were used on each processor. This in turn reduces the amount of nonproductive busy-waiting on each processor.
The acknowledgement can often be packed with message data needed to further the calculation. In addition, we could optimize further by sending buffered messages taking into account the size of the system message buffer, as indicated during program invocation. The buffering strategy we use not only handles flow control on the underlying message system, but also reduces the overall latency costs by sending fewer larger buffers, rather than more smaller buffers.
There are two possible messaging strategies: a "pull" strategy in which data is requested from another processor and a "push" strategy in which data is sent to a destination processor. The "pull" strategy requires two messages per data transfer, while the "push" strategy could possibly be optimized to only use one message per transfer, if empty acknowledgements were not required. The expected savings led us to use the latter strategy.
Portability across a wide range of message-passing machines, including the experimental MIT J-Machine, was desired. To achieve this goal, the message-passing code is organized as a set of procedures executed via what amount to remote procedure calls. The main program is simply a dispatcher that executes the appropriate procedure for a given message type, with arguments obtained from the message data.
The active message strategy was implemented on top of the Intel-provided messaging system, NX.
One additional optimization can be performed within the context of the CMM. Often, the same data (such as a cell's multipoles or its atoms) must be sent to multiple neighboring cells that are not on the local node. In many cases, all of those remote cells will be assigned to the same CPU. In that event, we can eliminate the redundant data transmissions by sending only one copy of the data but specifying that multiple cells as destinations. Such redundancy removal can provide significant performance gains.
Since large systems of atoms can span hundreds of megabytes per input file per processor, we found this approach to be stressful for the relatively small number of I/O nodes handling the disks on the Intel Paragon and Delta. When each processor opens its own file, the read requests are funneling through very few (16 on the 512 compute node Paragon we ran our timings on) I/O nodes. The I/O nodes get overloaded retrieving information spread across the disks, so we must throttle our requests. Another approach would have been to reorganize the single, unified structure file such that each processor could seek to its own place in the file and read. This would have allowed us to take advantage of the Intel optimized global read/write calls, but moved away from the single reader/writer mode which the J-Machine prefers. Having all the processors read all the atoms from the structure file, discarding those that are not local, is probably too expensive for the multi-gigabyte data sets needed for very large systems.
Output is again similar to the KSR code and includes the potential and kinetic energy at each timestep, as well as other parameters of the dynamics. At user-specified intervals, snapshots of the system are taken containing positions, velocities, and forces on each atom. Each snapshot is composed of one file written from each CPU.
We found that many hours of production runs not only needed the flexibility of restarting, due to machine crashes and varying scheduling policies, but would have also benefited from resuming on a different number of processors than the original run. Since the startup and checkpoint files were written on a per-processor basis, we would have had to reassign cells and atoms (and rebalance the load) if the number of processors changed. This was infeasible given the current I/O architecture; use of a unified input file might make this easier.
Although calculation per timestep might take longer on smaller numbers of processors, available CPU time for small to medium numbers of processors is often times much greater than similar blocks of time for large numbers of processors. Thus, time to solution for systems of atoms which would fit on various sized partitions could have been shortened by flexible restart capabilities.
Since cells on different nodes often need to interact, we need a method for determining which node a given cell at a given level is on. Since each processor is assigned a consecutive range of leaf cells, and since there is a fixed rule for assigning parent cells, all that is required is that each node have a copy of the table describing the leaf cell range limits for all the nodes. A simple binary search through this table (at most 9 steps for 512 nodes) produces the number of the node containing a desired cell.
While the nonbond calculations do not depend explicitly on the identity of atoms within the cells, the bonded calculations must in order to maintain the correct molecular topology. This requires that a processor calculating a bonded interaction know the coordinates of the atoms participating in that interaction. On the KSR, obtaining data from non-local atoms can be left to the memory system. On message-passing machines, we make the assumption that all bonded interactions only involve atoms in the same leaf cell or the nearest-neighbor leaf cells. Since the coordinates of the atoms in those cells need to be communicated anyway for the nearfield interaction computation, all we must do is save them until the valence calculations have been completed.
This portion of the code is just now being implemented; the current version stores all atom data in hash tables, taking care to keep data from remote nodes and data from the local node distinct. After each nearfield calculation, all of the data required for the valence calculations may be found in the hash tables. After the valence portion of the code, the remote atoms may be removed.
On the KSR, each processor may update the forces for non-local atoms as long as locking is used to prevent simultaneous updates by multiple processors. The message-passing code updates the partial forces in the local copies of remote atoms and then communicates these partial sums back to the "home" node after all of the bonded calculations have been performed. No locking is needed since only the local processor can access the atoms' memory.