s4.database.
The dihedral probabilities which are integral to our method require a
judicious choice of structural data. Therefore, we sought to use a
subset of protein structures from the Brookhaven Protein Database
(PDB) which was both diverse and accurate. The Brookhaven PDB
contains more than 500 protein crystal structures, even excluding
structures with only C coordinates. However, there are many
proteins which are represented numerous times or are highly homologous
to other proteins in the PDB dataset. Such identical, or nearly
identical, structures would tend to distort our probabilities in favor
of geometries found in those particular proteins. In order to
eliminate highly redundant structures, we carried out pairwise
sequence comparisons among 503 proteins in our initial PDB dataset,
using the ``align'' program from W.R. Pearson's FASTA sequence analysis
package[65]. Any protein with greater than 25%sequence
identity with another protein of higher resolution was eliminated.
This homology-elimination
process reduced our dataset from 503 proteins to 121. This dataset of
121 proteins, which we call U121, is useful for a wide variety
of statistical analyses. However, geometric analyses such as those
required here require high resolution data, so we further reduced the
dataset to 64 crystal structures which had 1.5 Å resolution data or
better, or had better than 2.0 Å resolution and R-factors below
than 20%. This dataset, which we call H64, was used to create
our probability grids. The 64 crystal structures comprising this
dataset are listed in Table
.