In recent years, lattice-based methods have become increasingly popular tools for theoretical studies of protein folding[3,19-21]. In these calculations, a protein is represented by points on a 2-D or 3-D lattice. Typically, each amino acid occupies a single lattice site[79], but some methods use other models, such as one backbone and one sidechain site per residue[95]. Conformations of a protein are represented by chains traced through the lattice, with consecutive residues occupying adjacent sites. Adjacent sites can also be filled if the chain folds back upon itself. Because positions are limited to points on a lattice, energy calculations are extremely fast. Valence terms such as bond stretches can be eliminated entirely, since there are only a few possibilities. In addition, nonbonded forces can be calculated rapidly because distances between lattice sites are known in advance. Therefore, lattice simulations greatly speed the evaluation of a protein's conformational space in two ways: the size of conformational space is decreased by allowing only lattice conformations and evaluation of each conformation is greatly decreased through the use of simplified energy terms.
Despite the simplifications of the lattice methodology, there is
still a huge number of possible conformations available to even a
small protein. And while energy functions may give favorable values
to the ``correct'' structure (the lattice conformation most closely
resembling the native structure)[79], they are rarely
sufficiently accurate to predict it outright. In order to evaluate
lattice conformations more fully, and to enable construction of
all-atom protein conformations from lattice models, we have developed
a ``C Forcefield'' (C
FF) for use in molecular mechanics simulations of
C
models of proteins. This forcefield is used to optimize lattice
conformations, enabling them to have conformations more like true
proteins. These optimized C
conformations can then be used as
templates for the PGMC C
Builder. This process, termed the
``Hierarchical Protein Folding Strategy'' (HPFS), is shown in
Figure
. The method has a hierarchy of refinement levels:
The simple C Forcefield which we have developed for lattice structure
optimization has valence terms, only. Nonbonded interactions, such as
van der Waals and electrostatic terms, are not included in the
forcefield. Future enhancements of the C
FF will include such
terms and will be amino acid-specific. The current implementation,
however, treats all amino acid types equally, and has the three terms:
and
The bond energy, , is summed over all C
-C
distances
, while angle and torsion terms are summed over
all virtual angles,
, and virtual dihedrals,
, as defined in
Figure
. The subscripts denote that different
angle and torsion force constants (
) and
equilibrium geometries (
) are
used for
helix and
sheet conformations. These bond and angle terms are
commonly found in atomic forcefields, but the torsion term is
unlike a typical torsion forcefield, which uses an expansion of cosine
terms (see Equation (
)). The present form was used because
the virtual dihedrals do not have probability minima or maxima at
, so no cosine expansion could reproduce the known
distribution. Unfortunately, problems arise for calculating atomic
forces when (
), so alternate functional forms
are being investigated.
Parameters for the C Forcefield have been determined from analyses
of the C
coordinates in the protein structures of the Brookhaven
PDB. A subset of 64 of the protein structures was used. This ``H64''
dataset was also used for the development of
and
grids and
is described in detail in Section
. Figure
shows
the distribution of C(i)-C
(i+1) distances in the H64 dataset,
using a 0.01 Å interval to determine probabilities.
From this distribution, an average,
, and standard deviation,
, can be calculated. The average is used directly in
Equation (
), while the force constant is derived from
where is the Boltzmann constant and
is the temperature.
Using these parameters in Equation (
) gives a probability
distribution very similar to that derived from the crystal structure.
The probability distribution is determined from:
Replacing the integral by a sum over 0.01 Å intervals gives the
probability distribution in Figure
.
Similar analyses can be made for the virtual angles () and
dihedrals (
). However, it should first be noted that there
are strong
propensities in protein backbones which lead to
corresponding
correlations. This is clearly seen in
Figure
, where 2000 randomly selected values from the
H64 dataset are plotted. There are two high density regions.
This can also be seen by binning the data. Figure
shows
probability grids derived from determining the fraction of all points
in the region (,
) for
and
intervals of 15
. There are two distinct
peaks, which correspond to
helix and
sheet regions, as is made evident
by the probability grids for HELIX and SHEET residues in Figure
.
The high probability regions for the two major secondary structure types are
listed in Table
. These regions account for 39.7%( helix) and 34.7%(
sheet) of all
points. The
pairs which fell within
the
helix or
sheet regions were used to calculate average values and
standard deviations of
and
for each of these
regions. These, in turn, were used as equilibrium geometries and to
calculate force constants as was done for bond lengths
(Equation (
)). All such parameters for the CFF are listed in
Table
. Note that the force constants are
significantly higher than the
ones, reflecting the much
sharper peak in the
helix region of the
probability distribution.
This forcefield described above was used to optimize lattice
conformations for several proteins. These lattice conformations were
generated by finding the conformations on a face-centered cubic (fcc)
lattice which best matched the crystal structures. These
conformations were then optimized by conjugate-gradients
minimization using the CFF. As shown in Table
,
C coordinates after minimization by the C
FF are usually
much better than lattice conformations. Figure
displays
this improvement more dramatically, by showing the lattice and
minimized structures of crambin. Clearly, the lattice constraint
imposes unnatural geometries on the C configuration, a problem
remedied by the C
FF.
The utility of the CFF is further displayed by the results in
Table
. In these simulations, several C coordinate
sets for crambin were used as templates for the PGMC C
Builder.
The results are shown in the table after the final all-atom
conformation is minimized with energy minimization using DREIDING.
Naturally, the C
coordinates from the crystal structure, itself, form the best template
for the C
Builder. Minimizing the crystal structure
C
atoms with the C
FF causes them to diverge from their
true coordinates, but a good model, with a backbone RMS deviation of
only 1.0 Å, can still be built. Use of the lattice conformation,
however, produced poor results, with a backbone RMS deviation of
nearly 2.0 Å. The results are significantly improved through the
use of the C
FF, which reduces the error per atom by almost 0.5
Å.
The CFF is, therefore, able to assist significantly in the
building of all-atom conformations of proteins from lattice models of
their C
coordinates. Other uses may include the evaluation of
different lattice models by energy evaluation and/or minimization.
This may ease the difficult task of determining which lattice
conformations are native-like. In addition, future enhancements of
the C
Forcefield will include nonbond forces as well as
residue masses, thereby allowing for the possibility of extremely fast
molecular dynamics simulations of a C
protein model.