In recent years, lattice-based methods have become increasingly popular tools for theoretical studies of protein folding[3,19-21]. In these calculations, a protein is represented by points on a 2-D or 3-D lattice. Typically, each amino acid occupies a single lattice site[79], but some methods use other models, such as one backbone and one sidechain site per residue[95]. Conformations of a protein are represented by chains traced through the lattice, with consecutive residues occupying adjacent sites. Adjacent sites can also be filled if the chain folds back upon itself. Because positions are limited to points on a lattice, energy calculations are extremely fast. Valence terms such as bond stretches can be eliminated entirely, since there are only a few possibilities. In addition, nonbonded forces can be calculated rapidly because distances between lattice sites are known in advance. Therefore, lattice simulations greatly speed the evaluation of a protein's conformational space in two ways: the size of conformational space is decreased by allowing only lattice conformations and evaluation of each conformation is greatly decreased through the use of simplified energy terms.

Despite the simplifications of the lattice methodology, there is still a huge number of possible conformations available to even a small protein. And while energy functions may give favorable values to the ``correct'' structure (the lattice conformation most closely resembling the native structure)[79], they are rarely sufficiently accurate to predict it outright. In order to evaluate lattice conformations more fully, and to enable construction of all-atom protein conformations from lattice models, we have developed a ``C Forcefield'' (CFF) for use in molecular mechanics simulations of C models of proteins. This forcefield is used to optimize lattice conformations, enabling them to have conformations more like true proteins. These optimized C conformations can then be used as templates for the PGMC C Builder. This process, termed the ``Hierarchical Protein Folding Strategy'' (HPFS), is shown in Figure . The method has a hierarchy of refinement levels:

- The lattice C-only model.
- The C model optimized using the CFF.
- Backbone atoms added by the C Builder (Phase 1).
- Sidechain atoms added by the C Builder (Phase 2).
- All-atom conformation optimized by full Cartesian energy minimization.

The simple C Forcefield which we have developed for lattice structure optimization has valence terms, only. Nonbonded interactions, such as van der Waals and electrostatic terms, are not included in the forcefield. Future enhancements of the CFF will include such terms and will be amino acid-specific. The current implementation, however, treats all amino acid types equally, and has the three terms:

and

The bond energy, , is summed over all C-C distances , while angle and torsion terms are summed over all virtual angles, , and virtual dihedrals, , as defined in Figure . The subscripts denote that different angle and torsion force constants () and equilibrium geometries () are used for helix and sheet conformations. These bond and angle terms are commonly found in atomic forcefields, but the torsion term is unlike a typical torsion forcefield, which uses an expansion of cosine terms (see Equation ()). The present form was used because the virtual dihedrals do not have probability minima or maxima at , so no cosine expansion could reproduce the known distribution. Unfortunately, problems arise for calculating atomic forces when (), so alternate functional forms are being investigated.

Parameters for the C Forcefield have been determined from analyses of the C coordinates in the protein structures of the Brookhaven PDB. A subset of 64 of the protein structures was used. This ``H64'' dataset was also used for the development of and grids and is described in detail in Section . Figure shows

the distribution of C(i)-C(i+1) distances in the H64 dataset, using a 0.01 Å interval to determine probabilities. From this distribution, an average, , and standard deviation, , can be calculated. The average is used directly in Equation (), while the force constant is derived from

where is the Boltzmann constant and is the temperature. Using these parameters in Equation () gives a probability distribution very similar to that derived from the crystal structure. The probability distribution is determined from:

Replacing the integral by a sum over 0.01 Å intervals gives the probability distribution in Figure .

Similar analyses can be made for the virtual angles () and dihedrals (). However, it should first be noted that there are strong propensities in protein backbones which lead to corresponding correlations. This is clearly seen in Figure , where 2000 randomly selected values from the

H64 dataset are plotted. There are two high density regions. This can also be seen by binning the data. Figure shows

probability grids derived from determining the fraction of all points in the region (, ) for and intervals of 15. There are two distinct peaks, which correspond to helix and sheet regions, as is made evident by the probability grids for HELIX and SHEET residues in Figure .

The high probability regions for the two major secondary structure types are listed in Table . These regions account for 39.7%( helix) and 34.7%( sheet) of all points. The pairs which fell within the helix or sheet regions were used to calculate average values and standard deviations of and for each of these regions. These, in turn, were used as equilibrium geometries and to calculate force constants as was done for bond lengths (Equation ()). All such parameters for the CFF are listed in Table . Note that the force constants are significantly higher than the ones, reflecting the much sharper peak in the helix region of the probability distribution.

This forcefield described above was used to optimize lattice conformations for several proteins. These lattice conformations were generated by finding the conformations on a face-centered cubic (fcc) lattice which best matched the crystal structures. These conformations were then optimized by conjugate-gradients minimization using the CFF. As shown in Table ,

C coordinates after minimization by the CFF are usually much better than lattice conformations. Figure displays this improvement more dramatically, by showing the lattice and minimized structures of crambin. Clearly, the lattice constraint imposes unnatural geometries on the C configuration, a problem remedied by the CFF.

The utility of the CFF is further displayed by the results in Table . In these simulations, several C coordinate sets for crambin were used as templates for the PGMC C Builder. The results are shown in the table after the final all-atom conformation is minimized with energy minimization using DREIDING.

Naturally, the C coordinates from the crystal structure, itself, form the best template for the C Builder. Minimizing the crystal structure C atoms with the CFF causes them to diverge from their true coordinates, but a good model, with a backbone RMS deviation of only 1.0 Å, can still be built. Use of the lattice conformation, however, produced poor results, with a backbone RMS deviation of nearly 2.0 Å. The results are significantly improved through the use of the CFF, which reduces the error per atom by almost 0.5 Å.

The CFF is, therefore, able to assist significantly in the building of all-atom conformations of proteins from lattice models of their C coordinates. Other uses may include the evaluation of different lattice models by energy evaluation and/or minimization. This may ease the difficult task of determining which lattice conformations are native-like. In addition, future enhancements of the C Forcefield will include nonbond forces as well as residue masses, thereby allowing for the possibility of extremely fast molecular dynamics simulations of a C protein model.

Sat Jun 18 14:06:11 PDT 1994