During a Monte Carlo step, either the backbone or side-chain
conformation of one amino acid residue, selected at random, is
altered. If the backbone conformation is to be changed, a new
pair is selected for the residue. The
pair is chosen from a
grid of probabilities where the spacing between the gridpoints is
S
. The grid, therefore, contains
gridpoints, where
. The third backbone dihedral angle,
, is
fixed at
during Monte Carlo simulations, except where it
occurs before proline residues. For prolines, there is a 7%chance of
flipping to the cis conformation (
). However,
even for proline the
is treated independently and not as a
third-dimension in the probability grid.
The probability grids were determined by partitioning every pair
in the proteins comprising the H64 dataset into bins of size
and normalizing. We have determined separate
probability grids for each amino acid, but it is sufficient to use
individual grids for the three major residue types: glycine, which
has no sidechain, proline, whose sidechain forms a closed loop with
the backbone, and the
other 18 ``standard'' amino acids. The
probabilities are
significantly different for these three residue types, as can be seen
in Figure
. The shape of the grid depends not only on
the data, but on the grid spacing, , as can be seen in
Figure
. A narrower spacing allows for much greater
conformational flexibility, which is especially important in
simulations of constrained systems. However, the total coverage of
conformational space is somewhat reduced for narrower grid spacings.
For instance, for standard residues, 110 of the 144 possible
30 gridpoints are populated (76.4%), while only 1114 out of
5184 gridpoints (21.5%) are populated on a 5
grid. Of course,
the number of populated gridpoints, and their probabilities, depends
on the size and quality of the dataset. Therefore, in order to
evaluate the grids produced from the H64 dataset, we have also
constructed grids using the U121 dataset.
The number of each type of residue found in the two datasets is shown in
Table
. The U121 dataset contains nearly three times as
many residues as the H64 dataset. Although it is advantageous to have
a larger sample
size when doing statistical analyses, this advantage is mitigated for
the U121 dataset because of the inclusion of low-quality structures.
This problem is made clear in Table
, where the number of
non-zero gridpoints is listed for the three residue types at various
grid spacings. The inclusion of data from all structures in the U121
dataset greatly increases the number of gridpoints which are
populated. This is the case for all three residue types at all five
spacing levels, but is particularly notable at grids spacings of
15 and less. Clearly, far more areas of
conformational
space have at least one representative in the U121 dataset. However,
it is difficult to say whether this is due to the larger sample size
or reflects the fact that low-resolution structures are included in
the U121 data. Unusual conformations in these low-resolution
structures may be due to poor crystallographic data and might even be
a cause of bad fits to data (high R-factors). A more interesting
analysis is the number of high-probability gridpoints (
), as shown in Table
. Because of the large
number of gridpoints with , the percentage having
is substantially less than 50%. This number
is very consistent across different grid spacings and is far more
consistent between the datasets. This indicates that the U121 dataset
has a large number of very rare
conformations, and it should not
be detrimental to exclude them from the probability grids used for
our simulations. This is especially true for the standard residues
and for the larger grid spacings of glycine and proline. For the
ultrafine 5
grids, there clearly is insufficient data for proline
and glycine conformations. The sample sizes for glycine and proline
are less than the number of 5
gridpoints, so every nonzero
gridpoint automatically has
greater than
. This
problem is particularly acute for the H64 dataset, where the
percentage of high-probability conformations drops off dramatically at 5
.
This dataset is probably inadequate for glycine and proline conformation
sampling at a 5
resolution.
Table
confirms what can be seen in
Figure
: the grids are substantially different for the
three residue types. Glycine is clearly more flexible, having a much
larger number of high-probability conformations. Proline, in
contrast, is far less flexible. There are far fewer high-probability
conformations for proline, as would be expected from geometrical
considerations. The closed ring formed by its backbone and sidechain
severely restrict the angle to angles near -60
. The
highest probability peak for each type of residue is shown in
Table
. For standard residues, the alpha-helical peak
predominates. For every spacing level, the alpha helical conformation
is the highest peak, even though the probability of picking the peak
gridpoint decreases as the total number of gridpoints increases. The
intra-strand hydrogen bonding of alpha-helices greatly favors
conformations near (). Therefore, the peak is
very sharp, as becomes increasingly clear for the finer grids in
Figure
. In contrast, the beta sheet region of the
grid, centered about (
), is much broader.
No individual gridpoint in the beta sheet region is as high
as the alpha helical peak, even though the beta sheet quadrant (I) has
nearly the same overall probability as the alpha helix quadrant (II)
(47.8%vs. 49.4%- see Table
). Proline grids have
two sharp peaks, as is seen for 30 in Figure
. The
two peaks are so similar that the identity of the highest peak depends
on both the grid spacing and the dataset. There is little probability
of proline conformations outside of the two peak regions; there is
almost no chance that the conformation is in quadrant III or IV.
The opposite is true for the third major residue type, glycine.
Glycine's great flexibility is clearly seen in Table
.
The four quadrants are almost equally populated, since there is no
sidechain to sterically hinder quadrant III and IV conformations.
Because of this flexibility, no single peak has a particularly high
probability (Table
).
We have also used the secondary structure designators in the protein
database (HELIX, SHEET, and TURN) to obtain separate probability grids
for alpha helix, beta sheet, and coil regions. We decided not to
create grids for beta turn residues because the four residues
involved in a turn usually have completely different
conformations and it would be counterproductive to treat them
identically. Presumably, eight-dimensional probability grids
generated for sequences of four consecutive
pairs would have peaks for
particular turn conformations as well, but the total number of turns
in our set of crystal structures is tiny compared to the immense
number of gridpoints on an eight-dimensional grid. Such grids would
have little advantage over a method which simply tries all known turn
configurations. We do have separate probability grids for coil
residues, however. We define coil residues as all those not
involved in any of the three major secondary structure types. Six
proteins in the H64 database had no HELIX, SHEET, or TURN designators,
and we excluded these from secondary structure analyses. We did not
want to assume a complete lack of secondary structural elements for
these proteins. The remaining 58 proteins with secondary structure
designators comprise the SS58 dataset, which we used to create the
probability grids shown in Figure
. Table
lists
the total number of samples of each residue type for each structural
class. While the coil population is large for all residue types, it
is particularly high for proline residues. The backbone nitrogen
of proline is bonded to the C of the sidechain, so it is not
available for hydrogen bond formation. Prolines therefore cannot
participate in the hydrogen bonds which stabilize
helices,
sheets, and
turns. The coil grid in Figure
contains significant
probabilities for both helix and
sheet conformations, but the
probabilities are much lower than those in the ``all-structures''
grid. Presumably, residues in the coil regions are not participating
in the extended hydrogen-bonding networks or involved in the
large-scale dipole-dipole interactions of
helices and
sheets.
Therefore, the coil probability grids are more indicative of the
inherent conformational energies of individual residues and,
therefore, are the grids which most closely resemble classic
Ramachandran plots[66] and
potential energy
maps[67]. These secondary structure-specific
grids are useful only when the secondary structure is known
beforehand. This is not the case for an ab initio prediction
of protein conformation, but is for simulations used in conjunction with C
coordinates, homology modeling, or secondary structure prediction algorithms.