## Chapter 5. Implementation on KSR

### 5.1. Introduction

Numerous other groups have developed parallel molecular dynamics codes [1,2,3,4,5,6,7,8,9,10,11,12,13]. All of these codes suffer from various limitations, however. Some [1,2,3,4] limit the range of the forces, keeping the number of interactions low and thereby minimizing communication, but eliminating the possibility of handling interesting chemical systems which rely upon the long-range Coulomb interaction. Others [5,6,7] use replicated data methods, in which a copy of each atom is stored on each processor. Such codes are obviously unusable for large-scale systems which may not fit on a single CPU. Still others [8,9,10,11] use a ring for communication, which allows every atom to interact with every other atom, but also requires that half the memory of each CPU be reserved for incoming atoms and that every atom be sent to every CPU, maximizing communication. Kalia [12,13] has developed codes that handle nonbonded forces well but require special three-body potentials instead of standard valence forcefield terms. We set out to develop a code that would be parallel, distributed-memory (limited only by the total memory in the system), large-scale (able to handle systems of millions of atoms), and general-purpose (accepting standard forcefields). The heart of such a code is the nonbonded energy calculation, which is performed by the CMM.

The primary parallel implementation of the CMM algorithm has been performed on the Caltech Materials and Process Simulation Center's KSR-1 parallel supercomputer using the C language. This code uses the KSR's shared-memory programming model and is robust, efficient, and modular. It has been used for production calculations on systems of up to 2 million atoms on the MSC 64-CPU, 2-gigabyte-memory machine and on systems of up to 4 million atoms on the Cornell Theory Center 128-CPU, 4-gigabyte-memory machine, with the main limitation being memory capacity. The program has been structured to permit the easy addition of new science (e.g. new forcefield terms and new integration methods), while still remaining efficient for large-scale computations.

### 5.2. KSR Architecture

The KSR AllCache architecture [14] presents a shared memory programming model to the user, even though it is implemented on top of a physically distributed memory. This model greatly simplifies the initial parallel programming task and, when attention is paid to the actual location of data, can still provide high parallel efficiency without requiring massive rewriting of code.

Two key features of the KSR architecture are used in the code. First, all data is stored in a single global address space. This means that data that is not specified as private is automatically sharable by all processors, merely by accessing a global variable or dereferencing a pointer. Second, any 128-byte "subpage" (the basic quantum of memory sharing in the machine) may be locked atomically by a processor so that only it retains access rights to the subpage. Any other processor attempting to lock (or even reference) the subpage must wait until the locking processor explicitly releases it. Though an exclusive lock primitive is often provided in parallel programming systems, since the KSR version is associated with 128 bytes of data, it is substantially more powerful. In particular, the cost of a lock is small in terms of execution time and zero in terms of memory usage (provided that the data being locked is at least 128 bytes). These values are significantly cheaper than most implementations provide.

An additional feature provided by the KSR architecture is a fast reciprocal square root approximation instruction. The KSR FORTRAN compiler issues this instruction, followed by two Newton-Raphson steps to ensure accuracy, when computing x^-1/2. This is much more efficient than calling a library function to determine the square root and then performing a reciprocal operation. Since the C compiler does not make the hardware instruction accessible and does not properly optimize an explicit expression for the function, inlined code generated by the FORTRAN compiler is used.

The caching strategy of the machine allows data to be passed from CPU to CPU merely by referencing it. As long as the new CPU continues to use the data, it will remain in its cache. The copy of the data in the old CPU will be flushed when necessary or when the data is modified. Because of the global address space, no bookkeeping needs to be done in the software to keep track of where a given piece of data is. This contrasts with message-passing architectures, in which data must explicitly be sent from one CPU to another, and it is often necessary to maintain elaborate structures to identify the location of a desired piece of data.

### 5.3. Memory

In order to handle large-scale systems of millions of atoms, the amount of memory used per atom must be kept to a minimum. With existing machines' capacities ranging up to a few tens of gigabytes, maintaining an average memory usage of about one kilobyte per atom is essential for the simulation of million-atom systems. The KSR implementation uses 128 bytes (one subpage) to store the most important information about each atom and additional memory of up to about 800 bytes per atom to store information about atomic connectivities, cell multipole and Taylor series coefficients, and valence forcefield information.

### 5.4. Data Structures

Critical atom data is stored in a single subpage for maximum efficiency. In addition, the most frequently used data is stored in the first 64 bytes of the subpage, or subblock, to enable it to be more efficiently cached in the onboard processor cache.

The cell data structure contains the position of the cell's center (or centroid), its multipoles, and its Taylor series coefficients. In addition, if the cell is a leaf cell, a pointer to the cell's atoms is maintained. The "done" variable is used as a validity or "presence" tag to help avoid explicit global synchronization (see section 5.11).

Atoms are stored in a globally-accessible array so that they may easily be retrieved by atom number. This is particularly useful in computing the valence forcefield portion of the energy, where each interaction is specified only in terms of the numbers of the atoms participating. The atoms are simultaneously linked into lists attached to the appropriate leaf cells to allow easy cell-by-cell handling in the CMM portion of the code.

Since each atom occupies one subpage, the atom array will be distributed element-by-element across the processors. A CPU performing computations using an atom's data will tend to maintain a copy of that data in its cache.

Pointers to cells are stored in globally-accessible arrays, one for each level. This allows a cell to be accessed easily given a pair of integers representing the level and cell number. Cells without atoms have no memory allocated for their data, but they do have pointer array elements (set to null) associated with them. As the number of levels increases, these arrays grow rapidly in size. For some systems containing significant amounts of open space, many of the cells will remain unoccupied, and it may be worthwhile to convert to a hash table or other sparse vector implementation rather than the explicit array. At levels up to about 7 (2^21 = 2 million cells), however, the overhead of keeping empty elements in the array is outweighed by the speed of access.

Iterator functions are used to step through all atoms or all leaf cells. Each such iterator is first initialized, and then the iterator returns a new atom or cell at each call. The end of the list of cells is indicated by returning a null pointer.

Connectivity is stored in an array of six elements for each atom. Each element is composed of an atom number, a translation to a neighboring unit cell, if necessary, and a bond type index used for computing bond energies.

Valence forcefield entries are stored in another list associated with each atom containing the interaction type, the atom numbers (and possibly translations) used in the interaction, and a type index from which the forcefield parameters can be obtained.

Both the connectivity arrays and the valence forcefield entries are globally accessible. This simplifies checking for connectivities of neighboring atoms and allows slightly less complex updating when an atom moves across unit cell boundaries.

### 5.5. Modularity

Most parameters to the program are specified in a control file containing commands. The control file input routine is designed to accept one command per line, switch based on the first word of the command, and pass the remainder of the command to option routines associated with each module to further parse keywords. This allows the control file reader to be essentially isolated from the state variables maintained by each module.

The forcefield parameters are specified in a separate file, the format of which is defined by BIOGRAF [15]. The forcefield file input routine makes two passes through the file. In the first pass, the number of entries in each section of the file is counted. Section dividers are used to trigger calls to specific readers that understand the format of each section. In the second pass, memory is allocated for each forcefield parameter array in which data parsed from the file is stored.

Different functions may be used for different interactions in the forcefield. When a set of forcefield parameters has been read from the file, it is passed to a routine that looks up the function type in a dictionary and calls the appropriate setup routine to store the data, after any preprocessing, in the forcefield parameter array. A pointer to the routine that actually computes the function is also stored. The use of the dictionary and function pointers allows additional functions to be added easily.

The overall sequence of events within a timestep is controlled by a single routine (real_nodes). Additional computational routines may be added, with calls placed in the master routine at appropriate places during the timestep. It is hoped that eventually this master routine can be replaced by an embedded language interpreter (such as Tcl or Perl) to allow even greater flexibility in the use of computational modules and analysis tools.

Integrators are also modular. Each integrator is called through a well-defined interface from the master sequencing routine. This interface is a list of pointers to functions provided by the integrator module. In addition, functions must be provided to save and restore any state information required by the integrator, and some additional functions used by the rigid molecule code are also required. This technique allows any integration method that can be structured to fit the (very general) interface to be used, and the integrator may keep any amount of private state information. Additional integrators may be added by merely defining the interface and providing a command to choose the appropriate function pointer list.

Atoms and cells could easily be reworked into C++ objects. Although many of their data elements are "public," most manipulations of these data structures are handled through access functions. As an example, atomic forces may only be updated through one of two calls: a version that locks the atom to prevent access by any other CPU, performs the update, and then unlocks the atom, or a version that avoids the locking. The latter version (which is actually inlined for efficiency) is used primarily in the CMM code, where the structure of the algorithm makes it impossible for two CPUs to be attempting to update the same atom. Nevertheless, it could be replaced by the locking version, perhaps for debugging purposes, without modification to the rest of the code.

### 5.6. Features

Even a fast, efficient code is useless unless it can solve problems of interest to chemists. To that end, the code supports general forcefields, including all normal valence terms, cross terms, and variant functional forms for each type of term. The commonly-used DREIDING [16] and AMBER [17,18] forcefields may be used (except for DREIDING hydrogen bond terms). Many additional features have been implemented to increase the problem domain that can be handled by the code.

#### 5.6.1. Coordinate Transformations

Two types of coordinate transformations need to be performed in the code when periodic boundary conditions are used. The first type is necessary when noncubic unit cells are used. As described in Chapter 4, coordinates in the unit cell in real space are transformed to lie within a unit cube. The reverse transformation is also required, as are utilities for slightly more complex transformations related to reciprocal-space vectors used in the RCMM.

The second type of coordinate transformation occurs when a bonded interaction spans a unit cell boundary. All atomic coordinates are kept within the unit cell by the code. If an atom moves across a unit cell boundary, it is mapped to the image position on the other side of the unit cell. As a result, each atom participating in a valence interaction crossing a cell boundary may need to be remapped into its image position outside the unit cell before the interaction energy is computed. Such displacements are stored with the atom number in the bond and valence force field structures; the appropriate corrections are made to the atomic coordinate values within the various interaction computation routines through a standard utility function.

These displacements are always to neighboring unit cells. The Cartesian coordinate adjustments for each of the 26 neighbors can be precomputed, resulting in a very fast transformation routine that merely performs three additions. Future versions of the code with variable unit cell parameters (e.g. for NPT dynamics) will require recomputation of these coordinate adjustments whenever the cell changes.

#### 5.6.2. Moving Atoms

During the course of the dynamics simulation, atoms will move throughout the simulation volume. At intervals, a scan through all the atoms is performed, and any atom outside its assigned cell is collected for reassignment to a new cell. If the new cell does not exist, it will be created, and if the old cell is left empty, it will be deleted. Note that the farfield must be recalculated after this reassignment process to ensure that it is consistent with the new atom locations.

At the same time, atoms outside the unit cell for PBC systems are remapped to image locations within the unit cell.

#### 5.6.3. PBC Displacement Updating

When atoms move across unit cell boundaries in PBC calculations, the displacements associated with them and the atoms with which they participate in bonded interactions must be updated. Two routines are provided that scan through all the bond and other valence force field interactions of a moving atom and modify the displacements to reflect the new location of the atom. The displacement encoding (an integer from -13 to 13) is chosen so as to allow displacements to be simply added, provided that no displacement extends beyond the immediate neighbor unit cells.

#### 5.6.4. Atom Tracking

During, for example, diffusion calculations, it is important to know how far in real space a given atom has moved since the beginning of the simulation, despite remappings due to periodic boundary conditions. Since the displacements above only maintain relative information, it is not possible to use them alone to determine coordinates in the initial frame of reference. A facility has been added to the code to designate certain atoms as being tracked. The state (coordinates, velocities, and forces) of the tracked atoms may be written out more frequently than that of the other atoms in the system, reducing the amount of data that needs to be saved. In addition, absolute displacements from the original frame of reference in each of the Cartesian coordinate directions are maintained for the tracked atoms, allowing recovery of the true distance moved by an atom.

#### 5.6.5. General Valence Forcefield

All bonded interactions are calculated in parallel by iterating through the list of atoms on each processor. Bond interactions are calculated twice (from the point of view of each atom participating in the bond); all other interactions are calculated only once, with the results communicated to the appropriate atoms. To date, this portion of the calculation has not been highly optimized. In particular, the amount of communication could be reduced by performing all calculations for each non-local atom at one time.

To set up the valence force field interactions, it is necessary to determine which interactions are to be computed and to assign each interaction a type from the forcefield. This type is just a key specifying the appropriate function and set of parameter values to be used when calculating the energy and forces due to the interaction.

Selecting interactions is a relatively simple process. From the input, we know the molecular connectivity: which atoms are bonded to which other atoms. Any atom that is bonded to two other atoms forms an angle. Any atom bonded to exactly three other atoms also forms an inversion center. Any angle plus an additional bond on one side forms a torsion. Some additional checking needs to be done to ensure that only one instance of each interaction is generated and that systems with loops (e.g. cyclopropane) are handled properly.

Once the atoms participating in an interaction have been identified, we can use their atomic types to determine the type of the interaction. To simplify this process, a set of trees, one each for bonds, angles, torsions, and inversions, is built from the forcefield file information. Each interaction is placed in a canonical order based on its atomic types. The first type is then used to choose a branch from the root of the appropriate tree; the second type selects a branch from the resulting node; etc. Additional branches are added to the trees to handle wildcards in the type specifications from the input file. The wildcard branches are only tried if there is no type-specific branch.

An additional tree is built for hydrogen bond (or off-diagonal nonbond) interactions, which are further described in section 5.6.8. Although the bond, angle, torsion, and inversion trees may be disposed of after the program is initialized, hydrogen bonds are not completely predetermined by the initial geometry and so this small additional tree must be retained throughout the simulation.

#### 5.6.6. Nonbond Exclusions

Many forcefields define the nonbond interactions so as to exclude any components stemming from atoms that are participating in bonded interactions. Such exclusions could be handled in the CMM nearfield computation step by checking each interaction to be calculated, but such checking would be quite expensive. In addition, if a bonded interaction spanned a leaf cell, the most distant atoms in the interaction would only interact through the CMM farfield, not the nearfield.

To overcome these difficulties, we compute the CMM normally, with no checking for exclusions. We then compute the excluded interactions explicitly, subtracting them from the energy and forces computed by the CMM. The list of excluded interactions can be built once, since it depends on the molecular topology, which doesn't change as long as no bonds are broken or created. The calculation of the exclusions can occur in parallel, with each CPU handling the computations for the atoms that are in the cells it owns.

The excluded energy is often dominated by Coulomb and van der Waals repulsion terms at short ranges and can thus be quite large. The net energy is then the difference between two large numbers (the unexcluded energy and the excluded energy), which could potentially lead to loss of accuracy. We have found, however, that use of double-precision (64-bit) floating point numbers provides more than adequate accuracy for non-pathological cases. Since this precision is all that the KSR floating point unit provides, there is no loss in performance.

#### 5.6.7. Spline Nonbonds

The code also handles spline-type nonbonds for comparison with this older method. As in traditional programs, a list of all nonbond interactions to be computed is generated periodically during the simulation. To speed up the list generation process, the code uses the same cell structure generated for the CMM. The sizes of the cells are set to at least the length of the spline cutoff distance. Then any interaction that could potentially be included on the nonbond list must be between atoms in neighboring cells. Each CPU generates a nonbond list for the atoms it owns; each interaction is included twice. The lists are stored in a packed form, removing redundancies in the storage of the numbers of local atoms. Exclusion tests are performed while building the nonbond list; no separate excluded force is calculated in this case. A standard cubic spline is used to smooth the energy function at the cutoff radius.

#### 5.6.8. Hydrogen Bonds and Off-Diagonal Nonbonds

AMBER-type hydrogen bonds are supported through a general facility that allows special off-diagonal nonbond parameters and functions to be applied. As for the spline nonbonds, a list of potential off-diagonal interactions is generated. Like the nonbond exclusions, the CMM-computed interaction is first subtracted; the new off-diagonal energy and forces are then added in.

#### 5.6.9. Initial Velocities

The initial velocities of the atoms in the system are generated by sampling a random Gaussian distribution for each Cartesian component. The standard deviation of the Gaussian is determined by the simulation temperature. Once selected, the velocities are rescaled to ensure that the initial temperature exactly matches the desired one. The selection and rescaling process is simple to parallelize; each CPU computes initial velocities for the atoms that have been assigned to it.

Use of truly random velocities makes debugging the code or methods for using it difficult, as no run may be repeated exactly. To overcome this, we need to preserve the seed used for the pseudorandom number generator. If we saved a seed for each CPU, however, we would not be able to execute the same run on a differing number of CPUs.

The solution used was to determine the generator seed algorithmically, by computing a function of each atom's coordinates. The generator is reinitialized for each atom. Use of a function of the coordinates ensures that there is no dependency on the number of CPUs, the assignment of atoms to CPUs, or any other aspect of the parallelism.

There is, of course, an option for production runs to use true random numbers seeded by the time-of-day clock on each CPU.

#### 5.6.10. Rigid Molecules

Rigid molecules are handled with a quaternion method [19]. The quaternion represents the rotational state of the molecule. The forces applied to atoms in the molecule are decomposed into a center-of-mass translational force and a torque. The translational force is applied to the center of mass of the molecule, which can be integrated normally. The torque is used to update the angular momentum of the molecule, which is in turn used to generate the new rotational quaternion through an iterative procedure.

#### 5.6.11. Perturbation Thermodynamics

A limited capability to perform perturbation thermodynamics calculations has been implemented. The program allows individual atoms to be mapped from one atom type to another. Only the nonbonded parameters (van der Waals and charge) are affected. The value of \lambda, the percentage blend of the two atom types, may be specified. If two values of \lambda are given, the program evaluates energies with each value at each timestep and prints the resulting energy difference along with various statistical information. Only one value of \lambda is used to determine the forces that control the dynamics.

### 5.7. Input/Output

To ease the transition from workstation-based programs, the program reads BIOGRAF-format forcefield and structure files, with only minor modifications needed to simplify the code or support additional functionality.

Input data to the simulation includes a control file giving parameters for the dynamics, a forcefield file giving parameters for the various terms in the energy expression, and a structure file containing the atomic position, charge, and connectivity information. The input data set totals 80-120 bytes per atom (depending on the system's connectivity).

Output includes the potential and kinetic energy at each timestep, as well as other parameters of the dynamics, such as the total Hamiltonian. At user-specified intervals, snapshots of the system are taken containing positions, velocities, and forces on each atom. These allow analysis of the properties of the system (including evolution with time) and also serve the important purpose of allowing the simulation to be restarted. The output files contain 72 bytes per atom, plus a small header, and a possible trailer for periodic boundary condition systems.

### 5.8. Parallelization

The CMM was initially implemented on single-processor workstations. Transferring the code to a single processor of the KSR-1 was simple, but achieving an efficient parallel implementation required substantial work.

The primary method of parallelization used is domain decomposition. We partition the cells (at all levels) across the set of CPUs. Each CPU then computes all relevant information for the cells and atoms it has been assigned, communicating with other CPUs as necessary. The two major obstacles to peak efficiency are the amount of communication required and imbalances in the amount of computation required on each CPU.

### 5.9. Communications

Since the CMM decomposes space into an octree of cells, it is natural to decompose the MD data across the processors of a parallel machine in the same spatial manner. Each processor is then responsible for a volume of space, and communication is only required across the surface area of that volume.

Minimizing this surface area while distributing the data is highly desirable to keep communications costs low. To achieve this goal, each cell is assigned a number. We use an octree numbering system, in which a cell's number is equal to its parent's number multiplied by 8 plus an index varying from 0 to 7. This system ensures that consecutively-numbered cells at the same level are generally close to each other in space. In particular, any range of cell numbers tends to form one or two approximately-cubic domains. Assigning such a range of cell numbers to each CPU will then tend to keep the surface area associated with each CPU small. Although this may not be the ideal partitioning, it works well, even on highly irregular (non-cubic) systems and is simple to implement.

A cell's number can thus be represented by a sequence of octal (base 8) digits, each corresponding to the child index at a different level.

Within this numbering system, the numbers of a cell's children or parent can be computed using simple expressions:

(1)

where the square brackets denote the greatest-integer function.

Determining the number of a cell's neighbor in, say, the -x direction is slightly more complex. Conceptually, we wish to subtract one from an integer composed of the bits forming the x coordinate of the cell. These bits are the lowest-order bits of each octal digit (three bit group). We can thus use the following C code to mask out the desired bits and perform the subtraction with borrows, if necessary, through the intervening bits, which are then restored:

        mask = 01111111111; /* octal */
if (x > 0) {
}
else {
return -1; /* no such neighbor */
}


Similar code applies to the other directions (the mask need merely be shifted) and to the positive directions (requiring additions with carries rather than subtractions with borrowing).

Because this implementation of the CMM does not store information for unoccupied cells, and because systems of interest are often irregularly shaped, load balancing is a particular problem with this code.

The simplest approach is to assign the same number of leaf cells to each CPU. This fails, however, since many cells may be empty. Particularly with regular cell numberings, assigning equal-sized ranges of cells to CPUs can often lead to some CPUs not having any atoms (and thus computations) at all.

To a first approximation, the computational time per timestep is dominated by the nearfield interactions, which in turn are dependent on the number of atoms (or occupied cells) assigned to each processor. Therefore, we arrange for each processor to be responsible for a consecutively-numbered range of cells containing no less than n_atoms/n_cpus atoms (except for the last CPU). The use of ranges keeps the surface area low, as mentioned above, and also limits the size of the tables needed to determine on which CPU a given cell resides to merely one integer per CPU. This approach led to satisfactory load balancing.

Atom movements may cause the load to become unbalanced again. Each cell can determine how many atoms it contains; a simple, rapid linear sweep through the CPUs then can readjust the cell ranges and communicate the cell and atom data to their new locations. The KSR architecture makes this last step trivial: data migrates to each processor's cache as it is referenced in the next timestep, so it is not necessary to communicate it ahead of time.

The final code uses an even more sophisticated approach. After the initial load balance using the above technique, the amount of time each CPU spends waiting at synchronization points is measured. CPUs with longer waiting times do not have enough work. At intervals in the simulation, each CPU compares its accumulated waiting time with the average value across all the CPUs. Those with shorter waiting times give up cells and atoms to those that have longer waiting times, in proportion to the ratio of the waiting time to the average. The proportionality constant is adjustable and was empirically chosen to be 2.

This more advanced repartitioning strategy led to significant performance improvements on irregular problems. See Chapter 7 for more details.

Upper levels in the tree are assigned in such a way as to generally minimize the amount of communication required during tree traversals. A parent cell is assigned to the same CPU as its 0-th numbered child. Given the load-balancing-determined ranges of leaf cells on each CPU, it is simple to determine the CPU containing a higher-level cell by a simple shift and binary search.

### 5.11. Avoiding Synchronization

Since the KSR architecture provides a shared memory programming model, we can avoid some synchronizations that would otherwise be required to maintain data dependencies by allowing processors to explicitly check for the availability of needed data. This is accomplished by placing a "volatile" variable in each cell data structure that is set to a different value depending on what portions of the cell data have been computed. Processors requiring cell data to proceed can check the variable and wait if the values they need have not yet been calculated. The structure of the algorithm guarantees that deadlock will never occur.

Avoiding global synchronizations, in which processors that may not yet have data dependencies nevertheless must wait for slower processors, substantially improves the efficiency and speed of the code. On large problems with sufficient numbers of cells and atoms per processor, almost no waiting for data dependencies occurs at the lower, most populous levels of the tree.

When global synchronizations do need to be performed, they are implemented using the KSR-supplied, POSIX-compatible barrier primitive.