Fred E. Cohen
Analysis and Prediction of Protein Structure and Protein Ligand InteractionsThe amino acid sequence of a protein codes for its three-dimensional structure. Computational methods are being developed to predict structure from sequence information. In general, these efforts have not succeeded. I will describe our work on de novo protein structure prediction and model building the structures of pharmaceutically interesting proteins by homology to proteins of known structure. Examples of the utility of these model-built structures in drug discovery will be presented. Finally, I will describe some of our more basic work toward understanding the origins of protein stability and the dynamic motions of polypeptide chains and our work on prion protein structures.
DE NOVO PROTEIN STRUCTURE PREDICTIONIn an attempt to simplify the general problem of protein folding, we have adopted a hierarchical approach. First, identify the location of secondary structure elements. Then, pack these pre-formed secondary structures to form an approximate tertiary structure. Finally, regularize and optimize the approximate tertiary structure to yield a detailed view of the folded protein.
SECONDARY STRUCTURE ELEMENTS AND THEIR ORIGINSWe have continued our work on pattern-based secondary structure prediction focusing on the class of proteins rich in a-helical structure. Building upon the success of our turn prediction algorithm for all helical proteins, we were able to develop patterns that correctly recognized the cores of a-helices with a 95% success rate. Unfortunately the N- and C-cap features proved to be more difficult to specify (56% and 48% success rate). This information could be combined to produce an accuracy of 71% on a residue by residue basis and a 78% success rate if one focuses on the recognition of core helical features. This is comparable to the accuracy of our neural network algorithm. We have been pleased to note that several of the capping patterns that we identified from our analysis of helical structures correctly anticipated many of the results of more detailed studies of helical caps by Rose and colleagues. However, we remain disappointed that helical capping structures seem to be relatively weak objects that are often dispensable from the global energetic perspective of the folding chain. In an effort to make our LISP based class dependent secondary structure prediction algorithms more available to the general biochemistry community, we have developed a Macintosh version of our software that is available upon request.
While a great deal of effort has been focused on the structures of a-helices and b-sheets and their sequential preferences, much less effort has been directed toward loops other than b-turns. We completed an extensive taxonomic survey of these aperiodic structures that suggested a general classification system for loops . Out of 432 loops (4 to 20 residues in length) extracted from 67 proteins, 205 were classified as linear (straps), 133 as non-linear and planar (W's) and 86 as non-linear and non-planar (z's). The remaining 8 were classified as compound loops because they contained a combination of strap, W, and z morphologies. This simple geometric classification strategy lead to a structural alphabet for tetrapeptides within a loop that was useful in describing local conformation. This structural alphabet together with the sequence dependent statistical preferences for these conformations provided the ground work for the development of a genetic algorithm that can generate loop structures subject to a variety of experimental and theoretical constraints. This algorithm is robust for loops as long as 20 residues, is efficient and requires much less computer memory storage space than a loop dictionary method. A comparison of the genetic algorithm (BLoop) with a loop dictionary approach for model building suggests that these methods yield equivalent results. Unlike loop dictionary approaches that encounter sampling problems for longer loops, BLoop can easily generate a large number of alternative conformations for these longer loops. We plan to exploit this tool and its successors in our modeling efforts on G-protein coupled receptors (M. Bower).
We remain concerned by the question, why is it so difficult to improve secondary structure prediction? One obvious point of consternation is the result of Kabsch and Sander that identical pentapeptide sequences chosen from parts of different proteins could adopt radically distinct conformations (e.g. a vs b). Given the increase in the size of structural database over the past decade, we anticipated that a number of hexapeptides could be found that would extend the Kabsch-Sander paradox. Within a set of proteins with less than 50% sequence identity, 59 pairs of identical hexapeptides were identified. These local structures were compared and their surrounding structural environments examined. We were surprised to find that within a protein structural class (a/a, b/b, a/b, a+ b) the structural similarity of identical hexapeptides usually is preserved. None of the eight examples of identical hexapeptides that formed a-helices is one protein and b-structure in another came from the same folding class. This work suggests that context dependent features introduced by more global properties of the domain can, in principle, be used to correctly predict the structures of these conformationally plastic sequences (B. Cohen).
These results enhance the importance of understanding what is the balance between local and global effects on the stability of secondary structure elements. Dill and colleagues had argued from their studies on compact polypeptide chains confined to a cubic lattice, that global effects were the dominant, if not exclusive, source of the stability for secondary structures. In an effort to verify this hypothesis in a more geometrically sensible setting, we studied the properties of an off-lattice one to three sphere per residue model of compact polypeptides. The conclusions of this work were somewhat at odds with the Dill hypothesis. We subsequently developed a more accurate representation of the polypeptide chain that included all non-hydrogen atoms explicitly. This work confirms that secondary structure arises from a balance of local (e.g., hydrogen bonding) and global (e.g., compactness arising to the hydrophobic effect) terms. Presumably, the cubic lattice was providing the simulated polypeptide with an implicit hydrogen bonding network. These results have led Dill and co-workers to modify their original stance on the origin of secondary structure and has led us in a promising new direction in the development of algorithms for secondary structure prediction (N. Hunt).
Current molecular mechanics' force fields provide a detailed view of molecular motions. Unfortunately, their complexity, coupled with current limits on computer speed, prevent computational chemists from simulating protein processes that take longer than a nanosecond. The time required to simulate a molecular motion relates to the square of the number of atoms and the maximum size of a stable time step during numerical integration of the equations of motion. To speed up a simulation, one must find a faster computer or decrease the number of effective atoms and increase the stable time-step size. We are building an intermediate-resolution force field that approximates amino acids by a sphere for the backbone and a sphere for the side chain. Aromatic residues require two (or three) spheres for each side chain. A backbone potential that mimics the behavior or more realistic all-atom backbone representations has been developed. We are currently exploring a hydrogen-bonding potential (a challenge when there are no explicit amide nitrogens or carbonyl oxygens and an implicit solvent-interaction term that avoids the need for the explicit inclusion of water molecules in the simulation. This remains an ambitious program, but preliminary calculations suggest that SPEEDY, a Simplified Potential for Energy Evaluation and DYnamics, runs approximately 100 times faster than current molecular mechanics' packages, such as AMBER (J. Troyer).
TERTIARY STRUCTURE ANALYSIS AND PREDICTIONIn keeping with our interest in helical structures, we have devoted much of our effort to an analysis of helix packing and an application of these observations to proteins of biomedical relevance including human growth hormone, Erythropoietin and Interleukin-4. Four helix bundles are a common structural motif that can be observed both independently and as components of larger folding units. We examined 221 globular proteins of known structure for possible four helix bundles. Previous computational studies of four helix bundles have placed arbitrary restrictions on interhelical packing angles. In this study we developed a geometric definition of four helix bundles based in part on solvent accessibility criteria that permits the removal of constraints on interhelical packing angles. Based on the observed pattern of interhelical angles, a bundle taxonomy was presented. This formalism is already providing a useful categorization method for new structural studies of proteins rich in a-helices (S. Presnell, N. Harris).
The helix-helix interactions within bundles were studied in detail. Central residues (the residues in the middle of a helix-helix interaction), contact normals (the line segment joining helix axes at their point of closest approach), and skew angles (the deviation of the central residue from the contact normal that facilitates side chain interdigitation) all were observed to have non-random distributions. A simple geometric model was developed for the helix-helix interface to explain these findings. Analysis of the helix-helix interaction data collected in this work confirms the importance of including skew angles in models of helix packing and should improve the accuracy of combinatorial strategies for the prediction of the tertiary structure of all-helical proteins. Additionally, the geometric properties observed in globular proteins provide insight into the structural organization of membrane spanning proteins.
We developed a collaboration with Jim Wells at Genentech to try to blend computational models of helix packing with low resolution experimental data on antibody interaction sites to yield intermediate resolution models of the structure of human growth hormone without reference to the existing crystallographic data. Structural constraints derived from different antibody epitopes on human Growth Hormone (hGH) were used to screen three-dimensional models of hGH that were generated by computer algorithms. Previously, alanine-scanning mutagenesis defined the residues that modulate binding to 21 different monoclonal antibodies to hGH. These functional epitopes were composed of 4-14 side chains whose a-carbons clustered within 2-23_. Distance and topographic constraints for these functional epitopes were virtually the same as constraints derived from known x-ray structures of protein-antigen complexes. The constraints were used to evaluate about 1400 models of hGH that were computer-generated by a secondary-structure prediction and packing algorithm. On average each functional epitope reduced the number of models in the pool by a factor of 2, so that 8 monoclonal antibodies could reduce the number of possible models to <10. The average r.m.s. deviation of a-carbon coordinates between the x-ray structure and either the pool of starting models or final models ranged from 13 to 16_ or 4 to 7_, respectively, depending on the pool of starting models and the level of constraints imposed. All of the final models had the correct folding topography and the best model was within 3.8_ r.m.s. deviation of the x-ray coordinates. This model was as close as it could have been because the models were built by using ideal helices and those in the x-ray structure are not ideal. Our studies suggest that epitope mapping data can effectively screen structural models and, when coupled to predictive algorithms, can help to generate low-resolution models of a protein.
In collaboration with Frank Bunn at Harvard, we developed a computational model of the erythropoeitin (Epo) structure and critically examined the units of this model through the creation of a large number of site directed mutants. Secondary structure prediction identified four helical regions (9-22, 59-76, 90-107, 132-152). A combinatorial packing algorithm explored 1.6 x 104 structures to identify 706 that were consistent with the connectivity of the chain and that were sterically sensible. Only 184 were compatible with the formation of a disulfide bridge between Cys7 and Cys161 and these structures resembled four helix bundles. The most likely of these structures is shown in Figure 1.
In order to test this model, site-directed mutants were prepared by high level transient expression in Cos7 cells and analyzed by a radio-immuno assay and by bioassays utilizing mouse and human Epo-dependent cell lines. Deletions of 5 to 8 residues within predicted a-helices resulted in the failure of export of the mutant protein from the cell. In contrast, deletions at the NH2 terminus (_163-166), or in predicted interhelical loops (AB: _32-36, _53-57; BC: _78-82; CD: _111-119) resulted in the export of immunologically detectable Epo muteins that were biologically active. The mutein _48-52 could be readily detected by radio-immunoassay but had markedly decreased biological activity. However, replacement of each of these deleted residues by serine resulted in Epo muteins with full biological activity. Replacement of Cys29 and Cys33 by tyrosine residues also resulted in the export of fully active Epo. Therefore, this small disulfide loop is not critical to Epo's stability or function. Having produced a list of predicted coordinates and distributed this model to a number of biologists interested in Epo structure-function relationships, we await the determination of this structure by NMR or x-ray crystallography so that we can learn how to improve our predictions.
Fig. 1: Model of the three-dimensional structure of erythropoietin. Ribbon diagram of the predicted Epo tertiary structure. The four a-helices are labeled A-D; Loops between helices are named for the helices they interconnect. Two regions of extended structure which could form hydrogen bonds between Loop AB and Loop CD are also presented. N- and O-glycosylation sites are indicated by the darkened bars in loops AB, BC and CD. Disulfide bonds bridge residues 29-33 in Loop AB, and 7-161 on the NH2 terminal side of Helix A and the COOH-terminal side of Helix D are not shown. N.B.: The loop tracing shown does not represent predicted coordinates.
We are in the process of completing our comparison of our predicted IL-4 structure and the NMR and x-ray structures of this molecule. Some of the comparison was the subject of a review for the FASEB Journal . As can be seen in Figure 2.
Fig. 2: The accuracy of the prediction was assessed when the NMR structure of interleukin-4 became available. In general, the helices were accurately assigned and the topology was correctly predicted to be a four-helix bundle.
The secondary structure prediction was ~90% accurate. Of the ~100 plausible structures, the "best" seven from the standpoint of minimal solvent accessible surface area formed right-handed four helical bundles with two overhand connections. Structure 8 on the rank ordered list formed a left-handed four helix bundle with two overhand connections (the correct topology) and deviated by 4.9_ r.m.s. from the NMR structure. To date, we have been unable to objectively distinguish the "correct" model structure from its topological enantiomer using profiling algorithms as well as a variety of methods developed by our groups and others. Perhaps this should not be surprising given the origin of the two families of models and the fact that the inter residue contact map of the "correct" model and its topologic enantiomers are extremely similar. We believe that these models can provide a challenging test for groups interested in distinguishing a structure from is misfolded counterparts. For this reason, we have provided these structures to several groups to facilitate their efforts to develop and optimize new algorithms for threading and structure validation.
This work highlights the need for new methods to identify the errors in current structure prediction methods. To this end, we have developed a new method for comparing protein structures, based on a minimal surface metric. A virtual polypeptide backbone is created by joining consecutive Ca atoms in a protein structure. The minimal surface between the virtual backbones of two proteins (the Area Functional) is determined numerically using an iterative triangulation strategy. The first protein is then rotated and translated in space until the smallest minimal surface is obtained. Such a technique yields the optimal structural superposition between two protein segments. It requires no initial sequence alignment, is relatively insensitive to insertions and deletions, and obviates the need to select a gap penalty. The optimal minimal area can then be converted to the Area-Ca distance, measured in Angstroms, to determine the structural similarity. This technique has been applied to a large class of proteins and is able to detect not only small-scale differences between closely related proteins but also large scale topological similarities between evolutionarily unrelated proteins that lack any obvious sequence homology. To measure the similarity between structurally dissimilar proteins, an additional measure (the Fit Comparison) is developed. This is a scale-invariant measure of structural similarity that is useful for determining topological similarities between dissimilar proteins with unrelated sequences (A. Falicov).
MODEL BUILT STRUCTURES AND STRUCTURE BASED DRUG DESIGNStructure based drug design traditionally has relied on the availability of high resolution x-ray crystallographic structures of the protein target of interest. Unfortunately, crystallographic or NMR spectroscopic analysis of particular new targets may prove difficult for a variety of experimental reasons. Given our interest in protein structure prediction and in the conformation of loops, and given the co-existence of a structure-based anti-parasitic drug design program that was stalled owing to particular difficulties with obtaining sufficiently pure malaria trophozoite cysteine protease and schistosome cercarial elastase for crystallographic studies, we constructed models of these two enzymes based on their homology to cysteine and serine proteases of known structure. This work, originally supported by a now defunct programmatic grant from the Advanced Research Projects Agency (ARPA) led to the identification of two small molecules: Oxalic bis 2 hydroxy-1-naphthyl methylene hydrazide, a malaria trophozoite cysteine protease inhibitor (IC50 value = 6mM) that was active against the intact parasite (IC50 value = 7mM); and 2-4 methoxybenzoyl 1-naphthoic acid, a cercarial elastase inhibitor (Ki=3mM) that inhibited parasite migration through skin. As our work on parasitic proteases had begun under this proposal and given the unexpected termination of our ARPA funding coupled with the need for the development of novel anti-malarials, we returned to this grant for support of the computational aspects of our anti-malarial lead optimization program.
In a recent article in Chemistry and Biology we described a pragmatic approach to structure-based inhibitor design for the trophozoite cysteine protease in malaria parasites. Analog design was based on the putative configurations of a ligand docked to a model of the three-dimensional structure of a malarial cysteine protease constructed by homology with papain and actinidin. In the absence of a crystal structure of the enzyme or enzyme-inhibitor complex, the ability to make and test quickly a wide variety of compounds was instrumental in our effort to build a structure-activity profile for inhibitor design. The hallmark of this method was simplified chemistry using commercially available starting materials.
Beginning with oxalic bis (2-hydroxy-1-naphthylmethylene)hydrazide, an inhibitor identified in a computational screen of the protease model against a database of small molecules, and following our design/synthesis strategy, we have obtained increasingly potent derivatives that block the ability of the parasites to infect and/or mature in red blood cells. The two best derivatives to date are 2-hydroxy-4-(4-nitrophenylmethylenoxy) benzoic (2-hydroxy-1-naphthylmethylene) hydrazide (IC50 = 450nM) and 2,4-dihydroxybenzoic (2-hydroxy-1-naphthylmethylene) hydrazide (IC50 = 150nM). These compounds represent a new class of anti-malarial chemotherapeutics that were identified from a computational search based on a model of the target trophozoite cysteine protease. Increasingly potent compounds have been identified through an intimate collaboration between computational and synthetic chemists in the absence of a detailed experimental structure of the target enzyme. These compounds approach the activity of chloroquine (IC50 = 20nM), but have a distinctly different mechanism of action. We have now shown that these compounds are active against chloroquine-resistant malaria. In very preliminary experiments, our colleagues at Georgetown have shown that 2,4-dihydroxybenzoic (2-hydroxy-1-naphthylmethylene) hydrazide is orally bioavailable, has a 3-4 hour circulating half life, and is not obviously toxic to rodents in acute and sub-acute settings (X. Chen, B. Gong).
We are applying this approach to several other drug design targets including Hepatitis A 3C proteinase (R. Dunbrack), a cysteine protease from T. Cruzi (X. Chen), prostate specific antigen (P. Bamborough) and a dihydrofolate reductase from cryptosporidium (P. Armand).
In an effort to understand the properties of malaria cysteine proteases from other species that cause human disease, we have studied the sequence and probable structure of the Plasmodium vivax protease recently cloned by our collaborator, Phil Rosenthal. We were able to show that while the cysteine proteases from various malaria species vary, the residues that line the subsite specificity pockets change little thereby preserving their hemoglobinase function in contrast to other cysteine proteases with similar degrees of sequence conservation but distinct functions (X. Chen).
PRION PROTEIN REPLICATIONPrions are a novel class of "infectious" pathogens distinct from viroids and viruses with respect to both their structure and the neurodegenerative diseases that they cause. Prion diseases are manifest as sporadic, inherited, and infectious disorders including scrapie, mink encephalopathy, chronic wasting disease, bovine spongiform encephalopathy, feline spongiform encephalopathy, and exotic ungulate encephalopathy of animals as well as kuru, Creutzfeldt-Jakob Disease (CJD), Gerstmann-Straussler-Scheinker syndrome, and fatal familial insomnia of humans. The prion protein (PrP) is the major, if not the only, component of prions. PrP exists in two isoforms: the normal cellular form (PrPC) and the abnormal disease (scrapie)-related form (PrPSc).
Stan Prusiner and I began a collaboration to study the sequences of the prion proteins from various animals. This lead to a study of the biophysical properties of the protein isoforms and peptides derived from the prion protein sequence. Analysis of these sequences by a variety of secondary structure prediction algorithms developed in our group and elsewhere offered an unusual result. While all of the algorithms agreed upon the location of the secondary structure elements, they disagreed as to whether a region would form an a-helix or a b-structure. This suggested that the normal form, PrPC and the disease causing form, PrPSc might differ only in conformation. Extensive experiments to detect distinctions in the covalent structure continue to be unrevealing.
We knew that PrP 27-30, a proteolytically processed version of PrPSc that retained infectivity was rich in b-structure and therefore hypothesized that the molecular etiology of disease could be a consequence of an a-helix to b-sheet structural transition. Peptides derived from the putative a-helical structural regions were synthesized and shown to form b-sheets and amyloid reminiscent of PrP27-30. Spectroscopic studies of purified PrPC demonstrated that the normal cellular form was rich in a-helical structure and devoid of b-sheets while PrPSc, the disease causing form, was enriched in b-structure. These results lent credence to the notion that a conformational change was at the heart of Prion diseases.
In an effort to understand these conformational changes in more detail and guided by a variety of spectroscopic and genetic data, we have used de novo modeling techniques developed by our group to produce a plausible model of the three-dimensional structure of PrPC. A heuristic approach consisting of the prediction of secondary structures and of an evaluation of the packing of secondary elements was used to search for plausible tertiary structures. After a series of experimental and theoretical constraints were applied, four structural models of four-helix bundles emerged. A group of amino acids within the four predicted helices were identified as important for tertiary interactions between helices. These amino acids are predicted to be part of the hydrophobic core of the molecule and should be important for the maintenance of a stable tertiary structure of PrPC. Among four plausible structural models for PrPC, the X-bundle model seemed to correlate best with the known point mutations that occur in the putative helical regions, and segregate with the inherited prion diseases. These 5 (of 11) mutations cluster around a central hydrophobic core in the X-bundle structure. Furthermore, these mutations occur at or near those amino acids which are predicted to be important for helix-helix interactions. The three-dimensional structure of PrPC that we proposed now provides a basis for rationalizing mutations of the PrP gene in the inherited prion diseases and a guide for the design of genetically engineered PrP molecules for further experimental studies (P. Bamborough).
Fig. 3 From this information and a variety of experimental results from other groups, we developed a conformational model for prion replication. Panel A illustrates the postulated events in infectious and sporadic prion diseases. Wild-type PrPC is synthesized and degraded as part of normal cellular metabolism. Stochastic fluctuations in the structure of PrPC can create (k1) a rare partially unfolded, monomer (PrP*) that is an intermediate in the formation of PrPSc. PrP* can revert (k2) to PrPC, be degraded, or form a complex (k3) with PrPSc. Normally, the concentration of PrP* is low and PrPSc formation is insignificant. In infectious prion diseases, exogenous prions enter the cell and stimulate conversion (k5) of PrP* into PrPSc, which is likely to be an irreversible process. In sporadic prion diseases, where there are no exogenous prions, the concentration of PrPSc may eventually reach a threshold level upon which a positive feedback loop would stimulate the formation of PrPSc. Limited proteolysis of the amino terminus of PrPSc produces (k7) PrP 27-30 a truncated form of PrPSc that polymerizes into amyloid and has a high content of b-sheet. Denaturation (k9) of PrPSc or PrP 27-30 into D-PrP renders these molecules protease sensitive and abolishes scrapie infectivity; attempts to renature (k10) D-PrP have been largely unsuccessful. Panel B illustrates the postulated events in inherited prion diseases. Mutant (D) PrPC is synthesized and degraded as part of the normal cellular metabolism. Stochastic fluctuations in the structure of DPrPC are greater than those in wild-type PrPC; these fluctuations create (k1) significant amounts of a partially unfolded, monomer (DPrP*) that is an intermediate in the formation of DPrPSc. DPrP* can revert (k2) to DPrPC, be degraded, or be converted (k5) into DPrPSc. Limited proteolysis of the amino terminus of DPrPSc produces (k1) DPrP 27-30, which in some cases may be less protease resistant than wild-type PrP 27-30. Thus, PrPC can be thought of as a kinetically trapped intermediate in PrP folding where the activation barrier between PrPC and PrPSc prevents conversion except under conditions that lower the activation barrier.