The global conformation of a protein can be well approximated by a
trace drawn through the coordinates of its C atoms. As the central
atom of each amino acid residue - the point at which the sidechain
branches off from the main chain - the C
atom is the best choice
to represent the amino acid as a whole. Figure
shows the
C trace of the small protein crambin, as well as a picture of the
backbone atoms and a picture of all atoms in the structure, from the
crystal structure by Hendrickson and Teeter[77] (Brookhaven
Protein Database (PDB) structure 1CRN). Because of their central
location, C
coordinates usually form the starting point for the
process of building a protein model from X-ray crystallographic
data[78]. In addition, purely theoretical schemes to
predict tertiary-structure often use a simplified protein model
containing only C
coordinates[80][79]. And
C
coordinates can form a template for homology-based molecular
modeling [81]. However, the C
coordinates do not provide
sufficient information for understanding the most critical aspects of
proteins such as binding and catalysis, which are determined by the
chemical and steric properties of the protein backbone and sidechains.
It is therefore necessary to provide a means of obtaining all atomic
coordinates for proteins when the C
coordinates alone are known.
Several methods for modeling complete protein structures from C
coordinates have been published in recent
years[2,6-10]. The primary purpose for such methods is to speed and
automate the process of building a protein model from crystallographic
data[78], but several other uses have been suggested. Holm
and Sander[85] describe how correct and incorrect protein
folds can be evaluated by such a method, while Rey and Skolnick
mention that their procedure may enable complete protein structures to
be built from the C
coordinates of a lattice
representation[85]. The work reported here has been
motivated by both of these factors: the desire to build full protein
structures from lattice structures, and to provide a means for
evaluating different lattice conformations. In addition, we have
found that the ``C
Builder'' described here has been useful for
homology modeling, as it allowed us to build a model of Hin
recombinase from the C
coordinates of
Cro[81].
The process of building full protein conformations from C
coordinates requires success in two areas: prediction of backbone
conformations in the presence of explicit geometric constraints (the
known C
coordinates) and prediction of sidechain conformations
constrained only by the conformation of the backbone and the presence
of other sidechains. Our method provides a consistent approach to
solving the two problems. Based primarily on Monte Carlo
conformational searching, our technique differs significantly from
previously published techniques, which range from the purely
geometric[86][82] to methods based primarily on
database searches of several consecutive
residues[85][78][83] or molecular
mechanics[84].
Our procedure for building protein structures from C coordinates
uses the conformational probabilities of individual residues, rather
than groups of residues and, therefore, does not depend upon the prior
existence of particular conformations in the protein database. The
process uses the Probability Grid Monte Carlo (PGMC) method to build,
first, the backbone conformation then, second, the sidechains. The
PGMC method, described fully in Chapter 4, modifies protein
conformations one residue at a time, by choosing either new backbone
(
) or sidechain (
) dihedral angles from probability matrices.
In the first phase of the PGMC C
Builder, the backbone is built
one residue at a time. As the protein
chain grows, the conformational space of the backbone is sampled by
the PGMC method using
probability grids. The DREIDING
forcefield[87] is used to evaluate the energy of each
structure, with additional harmonic constraint terms added between the template
C
coordinates and the C
coordinates of the growing chain.
After the entire backbone is built in this way, sidechain positions
are optimized during a second PGMC simulation. This second simulation
uses
probability grids to modify one sidechain conformation at
a time. Because the PGMC method uses random numbers both to determine
whether new conformations are accepted or rejected and to choose new
conformations, each run produces different results. Therefore, it is
general practice to generate numerous backbone conformations and select
those with the best energy to use in the second stage. Likewise, for each
backbone conformation, several Monte Carlo simulations are run to
optimize the sidechains, and the structure with the best overall
energy is selected as the optimized model.