The global conformation of a protein can be well approximated by a trace drawn through the coordinates of its C atoms. As the central atom of each amino acid residue - the point at which the sidechain branches off from the main chain - the C atom is the best choice to represent the amino acid as a whole. Figure shows the C trace of the small protein crambin, as well as a picture of the backbone atoms and a picture of all atoms in the structure, from the crystal structure by Hendrickson and Teeter (Brookhaven Protein Database (PDB) structure 1CRN). Because of their central location, C coordinates usually form the starting point for the process of building a protein model from X-ray crystallographic data. In addition, purely theoretical schemes to predict tertiary-structure often use a simplified protein model containing only C coordinates. And C coordinates can form a template for homology-based molecular modeling . However, the C coordinates do not provide sufficient information for understanding the most critical aspects of proteins such as binding and catalysis, which are determined by the chemical and steric properties of the protein backbone and sidechains. It is therefore necessary to provide a means of obtaining all atomic coordinates for proteins when the C coordinates alone are known.
Several methods for modeling complete protein structures from C coordinates have been published in recent years[2,6-10]. The primary purpose for such methods is to speed and automate the process of building a protein model from crystallographic data, but several other uses have been suggested. Holm and Sander describe how correct and incorrect protein folds can be evaluated by such a method, while Rey and Skolnick mention that their procedure may enable complete protein structures to be built from the C coordinates of a lattice representation. The work reported here has been motivated by both of these factors: the desire to build full protein structures from lattice structures, and to provide a means for evaluating different lattice conformations. In addition, we have found that the ``C Builder'' described here has been useful for homology modeling, as it allowed us to build a model of Hin recombinase from the C coordinates of Cro.
The process of building full protein conformations from C coordinates requires success in two areas: prediction of backbone conformations in the presence of explicit geometric constraints (the known C coordinates) and prediction of sidechain conformations constrained only by the conformation of the backbone and the presence of other sidechains. Our method provides a consistent approach to solving the two problems. Based primarily on Monte Carlo conformational searching, our technique differs significantly from previously published techniques, which range from the purely geometric to methods based primarily on database searches of several consecutive residues or molecular mechanics.
Our procedure for building protein structures from C coordinates uses the conformational probabilities of individual residues, rather than groups of residues and, therefore, does not depend upon the prior existence of particular conformations in the protein database. The process uses the Probability Grid Monte Carlo (PGMC) method to build, first, the backbone conformation then, second, the sidechains. The PGMC method, described fully in Chapter 4, modifies protein conformations one residue at a time, by choosing either new backbone () or sidechain () dihedral angles from probability matrices. In the first phase of the PGMC C Builder, the backbone is built one residue at a time. As the protein chain grows, the conformational space of the backbone is sampled by the PGMC method using probability grids. The DREIDING forcefield is used to evaluate the energy of each structure, with additional harmonic constraint terms added between the template C coordinates and the C coordinates of the growing chain. After the entire backbone is built in this way, sidechain positions are optimized during a second PGMC simulation. This second simulation uses probability grids to modify one sidechain conformation at a time. Because the PGMC method uses random numbers both to determine whether new conformations are accepted or rejected and to choose new conformations, each run produces different results. Therefore, it is general practice to generate numerous backbone conformations and select those with the best energy to use in the second stage. Likewise, for each backbone conformation, several Monte Carlo simulations are run to optimize the sidechains, and the structure with the best overall energy is selected as the optimized model.