## Chapter 7. Performance

### 7.1. Introduction

There are three important aspects to the performance of a program. First, determining the accuracy of the results is crucial. Second, the overall time to solution is a key criterion. Third, for parallel programs, the scalability of the code with respect to the size of the system and the number of processors is important.

The accuracy of the KSR and message-passing codes is identical, as they use the same computational routines. Their speeds and scalabilities are evaluated separately below.

### 7.2. Accuracy

There are two standards to compare against to determine the accuracy of the CMM. One is of course the exact nonbond calculation using all N^2 pairwise interactions. The other is the de facto standard method used for small systems, spline cutoff.

The spline cutoff method derives from the simplest possible way of reducing the scaling of the nonbond calculation: ignoring all interactions between atoms farther apart than a certain distance. Since this leads to a discontinuity in the energy function at the cutoff distance, a cubic spline function is used to smooth the energy in that region. Typical distances used for small systems are 8.0 Å for the inner radius of the spline, where the energy is equal to the unmodified energy, and 8.5 Å for the outer radius of the spline, where the energy has been reduced to zero.

Given a constant density of atoms, the number of atoms within the cutoff radius is approximately constant, thereby making the scaling of the spline cutoff method O(N).

The asymmetric unit of the human rhinovirus-14 protein coat was used as a test case. The structure was obtained from the Brookhaven Protein Data Bank (file 4RHV). The forcefield parameters and atomic charges were obtained from AMBER [1]. This system contains 8,530 atoms, including crystallographic water molecules.

Method               Energy       Error      Rel.  Max. Force     RMS Force
(kcal/mol)   (kcal/mol)  Error     Error         Error
(%)   (kcal/mol/Å)  (kcal/mol/Å)
=============    =============  ==========  =====  ============  ============
Exact            -1.44298x10^4
Spline-Cutoff    -1.39980x10^4    +431.8    +3.1%     40.17          7.06
CMM/center       -1.43682x10^4     +61.6    +0.4%      2.68          0.34
CMM/centroid     -1.44097x10^4     +20.1    +0.1%      1.99          0.28

Table 7-1. Accuracy of energy and force calculations.

Table 7-1 presents the energy and force results for the various methods. The exact calculation was performed using all 72,752,370 pairwise interactions in the system. The spline cutoff method used 8.0 Å and 8.5 Å inner and outer distances. The CMM used a 128 Å cube bounding box and a tree depth (maximum level) of 5, resulting in 3.9 atoms per occupied leaf cell. The multipole expansions were truncated at the level of quadrupoles.

The results show that for this representative system, the CMM outperforms the traditional spline-cutoff method by an order of magnitude in all categories. Of particular importance is the much smaller force error. Using cell centroids instead of geometric centers significantly improves the energy, with a small additional improvement to the force errors, at virtually no additional cost.

Finally, the CMM is also much faster than the spline cutoff method, even with its improved accuracy. On a single (KSR-1, 20 MHz) CPU, the test case required 84.2 sec to set up and 28.1 sec to calculate using spline cutoff; the CMM required only 14.3 sec to set up and 6.0 sec calculation time. Since both methods are approximately O(N), this ratio of times should scale to larger systems as well.

From a purely numerical standpoint, then, the CMM is far superior to spline-cutoff. There is one small disadvantage to using the CMM, however: it is not guaranteed to produce forces which satisfy Newton's Third Law. Since atoms that are distant from one another interact only through fields, and since those fields themselves are only approximate representations of the effects of groups of atoms, the forces generated may not be decomposable into a set of pairwise, equal and opposite forces. It is interesting to note that the Ewald sum commonly used for handling systems with periodic boundary conditions also uses fields and hence might produce non-Newtonian forces. The spline-cutoff method always deals with pairs of atoms and so must rigorously satisfy the Third Law.

Such errors in the forces have been observed to produce three effects. First, an artifactual net force and net torque may be applied to the system. Second, integration of the net force may lead to a net velocity, which can appear as a directional flow in the system. Third, errors in the velocities can contribute to the system's kinetic energy, which in turn affects the total Hamiltonian.

Code has been added to the program to remove all three of these effects, as desired by the user. Net translational forces are removed by subtracting a small corrective force vector equal to the net force divided by the number of atoms from each atom in the system. Similarly, net velocities can be removed by subtracting (mass-weighted) corrective velocity vectors from each atom. Finally, an experimental strategy for rescaling the system velocities to produce a rigorously conserved Hamiltonian was implemented.

Rather than instituting such ad hoc corrections, however, the user can also increase the accuracy of the computed forces by reducing the interval between farfield updates or by reducing the timestep of the simulation. We have found that a farfield update interval of 5 works well for most systems, with a reduction to 2 being necessary in a few cases. Performing farfield updates every step virtually eliminates the Hamiltonian drift in almost all cases.

### 7.3. Timing

Time to solution for the message-passing code is currently rather poor due to lack of optimization of various parts of the communication routines. A naive implementation of active messages on current hardware produces many small messages, each with considerable latency (ironically, the exact opposite of the intended effect of active messages on future hardware). To overcome this, we can bundle together multiple messages that are destined for the same node. Analyzing how best to do this is an ongoing research project.

With the current version of the code, a 1 million atom argon cluster system (calculating only nonbonded forces) takes approximately 35 sec per timestep on all 512 nodes of the Intel Paragon, using a farfield update frequency of 5. A 10 million atom system, the largest run to date, takes approximately 330 sec per timestep on all 512 nodes.

The KSR version, in contrast, does substantially better in performance per CPU. On a 1 million atom virus dimer, including all valence forcefield terms, we obtain a time of 64.7 sec on 60 CPUs, or about four times the performance of the Paragon code.

The dynamic load balancing implemented in the KSR code can have a substantial effect on the timing. On a very small, 463 atom system running on 4 nodes at CMM level 2, we see an improvement in the farfield computation time of 13% (from 55 ms to 48 ms) due to the reduced load imbalance.

### 7.4. Scalability

Each of the steps comprising the CMM is linear and scalable, or nearly so.

There are seven steps in the CMM; these may be divided into two major parts. The five steps of the first part compute the farfield (the Taylor series expansions representing the field from atoms far away from each atom), while the two steps of the second part compute the nearfield (the explicit calculation of effects due to atoms near each atom).

The first step, generation of the leaf cell multipoles, is fully linear and runs in parallel since there are no data dependencies.

The second and third steps, computation of the cell centers and propagation of the multipoles upward through the tree, both require a traversal of the octree. Since the number of cells in the system is the sum of a geometric series with a logarithmic number of terms, it is essentially proportional to the number of atoms.

(1)

where \kappa is the number of atoms per cell at the finest level, a constant.

Each pass through the tree, whether upward or downward, involves a constant number of computations per cell and therefore is linear in the number of atoms in the system. The tree traversals cannot be made fully parallel, however, as there are increased data dependencies near the root of the tree. On the other hand, since the number of computations to be done near the root is relatively small, due to the high degree (8) of the octree, the tree traversal time is dominated by the computations near the leaves, which are highly parallel.

The fourth and fifth steps, the PNC computation and the propagation of the Taylor series coefficients downward through the tree are also linear and highly parallel as argued above.

The two steps of the nearfield computation (computing explicit interactions with atoms in the same cell and with atoms in neighboring cells) are perfectly linear and can also execute in a parallel fashion, limited only by the communications overhead of transmitting atoms from leaf cells to their neighbors.

Finally, the dynamics step has only one data dependency, a global sum to determine the overall kinetic energy, with the rest being perfectly parallel and linear.

The valence computations are essentially linear in the number of atoms, since each atom is only connected to a limited number of other atoms and can thus participate in only a limited number of valence interactions. In the message-passing code, we assume that no additional communications will be required to compute the valence interactions; this should also hold true for the KSR code, since the valence computations occur after the nearfield step in which neighboring atoms are accessed and brought into the processor's cache.

The total amount of computation that occurs is thus linear in the number of atoms. Nonlinearities in the scaling of computation with number of CPUs are the result of load balancing inefficiencies, which lead to waiting at global synchronization points.

The total amount of communication that is required is almost linear in the number of atoms, except for the tree effects described above. The fraction of this communication that occurs off-node, however, varies depending on the number of CPUs used, and will in general vary as the total surface area of the boundaries between cells assigned to each CPU, which is approximately . Further complicating the analysis, though, is the fact that much of this communication can itself occur in parallel. The amount of communication can also be decreased by taking into account the fact that an atom or PNC may need to interact with multiple cells on the same destination node. This avoidance of redundant transmissions has been implemented in the message-passing code for the PNC multipole communication step, but not yet for the nearfield atom communication step. On the KSR, this redundancy is automatically eliminated because the data is cached on the destination node.

The best case time is thus

while the worst case time is

where C is constant setup overhead, t_comp is the computation time per atom, t_comm is the communication time per cell, and k_loadbal represents the overhead due to imperfect load balancing.

The message-passing program was tested for performance on a series of multi-million atom argon cluster systems. Although these systems do not include Coulombic charges and their interactions, all Coulomb terms were still calculated (and correctly resulted in zero energy and zero force) and hence are included in the timing results. The calculations were run on the CSC Paragon XP/S using OSF/1 Release 1.0.4. Five cluster sizes were used: 1 million, 2 million, 5 million, 8 million, and 10 million atoms.

For a constant number of atoms, if we plot the logarithm of the time versus the logarithm of the number of CPUs, we will ideally get a line of slope -1 by Equation 2. As n_cpus gets large, the slope should level off and eventually begin increasing to a value of at most 1/3 as in Equation 3, assuming that imperfect load balancing does not depend much on the number of CPUs used.

Figure 7-1 shows such a graph of log(time) against log(CPUs) for the farfield Taylor series generation process. The number of CPUs along the X axis ranges from 64 to 512. Three lines are drawn to show the scaling for systems of different sizes ranging from 1 million to 5 million atoms. The 8 million and 10 million atom systems could only be run on all 512 CPUs. The thick line shows the slope that would be achieved for ideal (perfectly linear) scaling.

This portion of the calculation contains all of the tree manipulations. The effects of the data dependencies inherent in the tree, which cause imperfect parallelization, can be seen by the less negative slope of the lines, especially for the smallest, 1 million atom system. Larger systems, in which the amount of computation per node increases, show better scaling which is more nearly parallel to the ideal line. Note that we would only reach the regime of zero or positive slope in pathological cases with much too little computation for the amount of communication required (i.e. too few atoms spread across too many CPUs).

Note that the farfield computation is only performed at intervals, so its imperfect scaling has a relatively small effect on the overall time to solution.

Figure 7-2 shows the same type of graph, but for the nearfield and integration computations. This portion of the calculation contains no tree-derived data dependencies and its scaling curves are close to parallel with the ideal line.

For a constant number of CPUs, if we plot the time per atom versus the number of atoms, we should ideally get a constant. Deviation from the constant line should be most apparent at small numbers of atoms, since the deviation is expected to scale as, at worst, . These graphs are shown in Figure 7-3 for the farfield computation and Figure 7-4 for the nearfield and integration steps, in which the X axis is the number of atoms in the simulated system in millions and the Y axis is the time spent in the indicated portion of the code divided by the number of atoms.

We see that the lines for 128, 256, and 512 CPUs are close to flat, with the expected upturn for the farfield computation at small system sizes. To show that this is in fact due to the communications overhead, as described in the theoretical scaling formula (Equation 3), we can plot the time per atom against the number of atoms to the -1/3 power. This should give lines with a slope of zero in the best case, or (a positive constant) in the worst case. In Figure 7-5, the lines of zero or constant positive slope show that the imperfect scaling for the farfield computation is in fact due to the communications term in the theoretical scaling formula.

Figure 7-5. Scaling vs. for farfield computation.
On the KSR, we get close to linear scaling with both CPUs and atoms. Inverting Equation 2 above, we can compute an expected value of t_comp from the time per timestep, the number of CPUs, and the number of atoms. Doing so gives the approximately equal values in the following table for a viral system, including valence forcefield interactions:

Atoms (millions)        0.5    1.0    1.0    0.5    1.0
CPUs                     30     30     45     60     60
Time (CPU-ms/atom)      3.07   3.55   3.35   4.39   3.98

Table 7-2. Scalability for KSR code.

For comparison, fitting lines to the above graphs for the message-passing code gives t_comp values ranging from 5.6 to 18 CPU-ms/atom for the farfield step and 12 to 18 CPU-ms/atom for the nearfield and dynamics step.