During a Monte Carlo step, either the backbone or side-chain conformation of one amino acid residue, selected at random, is altered. If the backbone conformation is to be changed, a new pair is selected for the residue. The pair is chosen from a grid of probabilities where the spacing between the gridpoints is S. The grid, therefore, contains gridpoints, where . The third backbone dihedral angle, , is fixed at during Monte Carlo simulations, except where it occurs before proline residues. For prolines, there is a 7%chance of flipping to the cis conformation (). However, even for proline the is treated independently and not as a third-dimension in the probability grid.
The probability grids were determined by partitioning every pair in the proteins comprising the H64 dataset into bins of size and normalizing. We have determined separate probability grids for each amino acid, but it is sufficient to use individual grids for the three major residue types: glycine, which has no sidechain, proline, whose sidechain forms a closed loop with the backbone, and the other 18 ``standard'' amino acids. The probabilities are significantly different for these three residue types, as can be seen in Figure . The shape of the grid depends not only on the data, but on the grid spacing, , as can be seen in Figure . A narrower spacing allows for much greater conformational flexibility, which is especially important in simulations of constrained systems. However, the total coverage of conformational space is somewhat reduced for narrower grid spacings. For instance, for standard residues, 110 of the 144 possible 30 gridpoints are populated (76.4%), while only 1114 out of 5184 gridpoints (21.5%) are populated on a 5 grid. Of course, the number of populated gridpoints, and their probabilities, depends on the size and quality of the dataset. Therefore, in order to evaluate the grids produced from the H64 dataset, we have also constructed grids using the U121 dataset.
The number of each type of residue found in the two datasets is shown in Table . The U121 dataset contains nearly three times as many residues as the H64 dataset. Although it is advantageous to have a larger sample size when doing statistical analyses, this advantage is mitigated for the U121 dataset because of the inclusion of low-quality structures. This problem is made clear in Table , where the number of non-zero gridpoints is listed for the three residue types at various grid spacings. The inclusion of data from all structures in the U121 dataset greatly increases the number of gridpoints which are populated. This is the case for all three residue types at all five spacing levels, but is particularly notable at grids spacings of 15 and less. Clearly, far more areas of conformational space have at least one representative in the U121 dataset. However, it is difficult to say whether this is due to the larger sample size or reflects the fact that low-resolution structures are included in the U121 data. Unusual conformations in these low-resolution structures may be due to poor crystallographic data and might even be a cause of bad fits to data (high R-factors). A more interesting analysis is the number of high-probability gridpoints (), as shown in Table . Because of the large number of gridpoints with , the percentage having is substantially less than 50%. This number is very consistent across different grid spacings and is far more consistent between the datasets. This indicates that the U121 dataset has a large number of very rare conformations, and it should not be detrimental to exclude them from the probability grids used for our simulations. This is especially true for the standard residues and for the larger grid spacings of glycine and proline. For the ultrafine 5 grids, there clearly is insufficient data for proline and glycine conformations. The sample sizes for glycine and proline are less than the number of 5 gridpoints, so every nonzero gridpoint automatically has greater than . This problem is particularly acute for the H64 dataset, where the percentage of high-probability conformations drops off dramatically at 5. This dataset is probably inadequate for glycine and proline conformation sampling at a 5 resolution.
Table confirms what can be seen in Figure : the grids are substantially different for the three residue types. Glycine is clearly more flexible, having a much larger number of high-probability conformations. Proline, in contrast, is far less flexible. There are far fewer high-probability conformations for proline, as would be expected from geometrical considerations. The closed ring formed by its backbone and sidechain severely restrict the angle to angles near -60. The highest probability peak for each type of residue is shown in Table . For standard residues, the alpha-helical peak predominates. For every spacing level, the alpha helical conformation is the highest peak, even though the probability of picking the peak gridpoint decreases as the total number of gridpoints increases. The intra-strand hydrogen bonding of alpha-helices greatly favors conformations near (). Therefore, the peak is very sharp, as becomes increasingly clear for the finer grids in Figure . In contrast, the beta sheet region of the grid, centered about (), is much broader. No individual gridpoint in the beta sheet region is as high as the alpha helical peak, even though the beta sheet quadrant (I) has nearly the same overall probability as the alpha helix quadrant (II) (47.8%vs. 49.4%- see Table ). Proline grids have two sharp peaks, as is seen for 30 in Figure . The two peaks are so similar that the identity of the highest peak depends on both the grid spacing and the dataset. There is little probability of proline conformations outside of the two peak regions; there is almost no chance that the conformation is in quadrant III or IV. The opposite is true for the third major residue type, glycine. Glycine's great flexibility is clearly seen in Table . The four quadrants are almost equally populated, since there is no sidechain to sterically hinder quadrant III and IV conformations. Because of this flexibility, no single peak has a particularly high probability (Table ).
We have also used the secondary structure designators in the protein database (HELIX, SHEET, and TURN) to obtain separate probability grids for alpha helix, beta sheet, and coil regions. We decided not to create grids for beta turn residues because the four residues involved in a turn usually have completely different conformations and it would be counterproductive to treat them identically. Presumably, eight-dimensional probability grids generated for sequences of four consecutive pairs would have peaks for particular turn conformations as well, but the total number of turns in our set of crystal structures is tiny compared to the immense number of gridpoints on an eight-dimensional grid. Such grids would have little advantage over a method which simply tries all known turn configurations. We do have separate probability grids for coil residues, however. We define coil residues as all those not involved in any of the three major secondary structure types. Six proteins in the H64 database had no HELIX, SHEET, or TURN designators, and we excluded these from secondary structure analyses. We did not want to assume a complete lack of secondary structural elements for these proteins. The remaining 58 proteins with secondary structure designators comprise the SS58 dataset, which we used to create the probability grids shown in Figure . Table lists the total number of samples of each residue type for each structural class. While the coil population is large for all residue types, it is particularly high for proline residues. The backbone nitrogen of proline is bonded to the C of the sidechain, so it is not available for hydrogen bond formation. Prolines therefore cannot participate in the hydrogen bonds which stabilize helices, sheets, and turns. The coil grid in Figure contains significant probabilities for both helix and sheet conformations, but the probabilities are much lower than those in the ``all-structures'' grid. Presumably, residues in the coil regions are not participating in the extended hydrogen-bonding networks or involved in the large-scale dipole-dipole interactions of helices and sheets. Therefore, the coil probability grids are more indicative of the inherent conformational energies of individual residues and, therefore, are the grids which most closely resemble classic Ramachandran plots and potential energy maps. These secondary structure-specific grids are useful only when the secondary structure is known beforehand. This is not the case for an ab initio prediction of protein conformation, but is for simulations used in conjunction with C coordinates, homology modeling, or secondary structure prediction algorithms.