Next: Backbone () Probability Up: Methodology Previous: Methodology

The Protein Database Subset


The dihedral probabilities which are integral to our method require a judicious choice of structural data. Therefore, we sought to use a subset of protein structures from the Brookhaven Protein Database (PDB) which was both diverse and accurate. The Brookhaven PDB contains more than 500 protein crystal structures, even excluding structures with only C coordinates. However, there are many proteins which are represented numerous times or are highly homologous to other proteins in the PDB dataset. Such identical, or nearly identical, structures would tend to distort our probabilities in favor of geometries found in those particular proteins. In order to eliminate highly redundant structures, we carried out pairwise sequence comparisons among 503 proteins in our initial PDB dataset, using the ``align'' program from W.R. Pearson's FASTA sequence analysis package[65]. Any protein with greater than 25%sequence identity with another protein of higher resolution was eliminated. This homology-elimination process reduced our dataset from 503 proteins to 121. This dataset of 121 proteins, which we call U121, is useful for a wide variety of statistical analyses. However, geometric analyses such as those required here require high resolution data, so we further reduced the dataset to 64 crystal structures which had 1.5 Å resolution data or better, or had better than 2.0 Å resolution and R-factors below than 20%. This dataset, which we call H64, was used to create our probability grids. The 64 crystal structures comprising this dataset are listed in Table .
Sat Jun 18 14:06:11 PDT 1994