Prediction of Intrinsically Unstructured Proteins

 

IUPred2 is available!
IUPred
Theory
How to use
ANCHOR
Related links
Downloads
NEW
 

IUPs

Intrinsically unstructured/disordered proteins have no single well-defined tertiary structure in their native state. In many cases a protein is fully disordered, while in many other cases there are long disordered segments in otherwise ordered, folded proteins. The structure of some IUPs resembles the denatured states of ordered proteins, best described as an ensemble of rapidly interconverting alternative structures, while others have some tendency to form local secondary structure elements or show molten-globula like characteristics. Despite their lack of a well-defined globular structure, these proteins carry out basic functions, mostly associated with signal transduction, cell-cycle regulation and transcription. Often, their function is realized via molecular recognition in which structural disorder confers specific advantages, such as increased speed of interaction and specificity without excessive binding strength. Disordered regions can also constitute flexible linkers, or spacers that have a role in forming macromolecular assemblies. Some well studied examples of IUPs include p21, the N-terminal domain of p53 or the transactivator domain of CREB. The importance of protein disorder is further underlined by its prevalence in various proteomes. In some eukaryotic genomes more than 20% of the coded residues are predicted as disordered.

Most protein structures known so far have been elucidated by X-ray crystallography. However, this technique is not suitable to characterize disordered regions of proteins, as corresponding regions are missing from the electron density map due to their intrinsic flexibility. NMR spectroscopy is the most powerful method to obtain site-specific information on the structure and dynamics of IUPs. Other techniques, including CD and other spectroscopic methods, hydrodynamic measurments, proteolytic degradation can also give important information about protein disorder. However, as each technique probes different aspects of protein structure, they do not necessarily correctly identify disorder. For example, loopy proteins, which have no repetitive secondary structure, would appear disordered by CD but ordered by the other techniques. With NMR, disorder often is concluded from poor signal dispersion, which does not distinguish between random-coils and molten globules of high potential to fold in the presence of a partner. In X-ray crystallography, crystal packing may enforce certain disordered regions to become ordered, and disordered binding segments are often crystallized in complex with their partner and are classified ordered despite their lack of structure in isolation. In addition, wobbly domains would appear disordered, despite their intrinsic structural order. As a result, available datasets are rather limited in size and are heterogeneous in terms of experimental conditions, techniques and interpretation of data. They also lack in consistency due to the absence of clear conceptual and operational definition(s) of structural disorder. All these result in false positive- and false negative classifications, i.e. the inclusion of ordered segments in disorder databases and the exclusion (and inclusion in ordered reference database) of disordered proteins/segments. In consequence, predictors trained on these datasets for assessing disorder reflect these uncertainties. Our method, however, was parameterized using only globular proteins, and does not suffer from these uncertainties.

The basis of predicting protein disorder is the difference in sequence characteristics between folded and disordered proteins. Typically, IUPs exhibit a strong bias in their amino acid composition and even a reduced alphabet is able to recognize them at the level of complete sequence. Other results indicate, however, that there are differences in sequence properties among different types of disordered proteins. Various factors have been suggested to be important in terms of protein disorder, including flexibility, aromatic content, secondary structure preferences and various scales associated with hydrophobicity. Beside low mean hydrophobicity, high net charge was also suggested to contribute to disorder. All these different analyses, though, hint that the amino acid composition of IUPs results in their inability to fold due to the depletion of typically buried amino acids and enrichment of typically exposed amino acids, which implies that globular proteins have sequences with the potential to form a sufficiently large number of favorable interactions, whereas IUPs do not. In this method we put this inference on a quantitative footing by taking an energetics point of view. On this ground, the sequences encoding for globular proteins and IUPs can be distinguished.

Short description of our method

Our prediction method for recognizing ordered and disordered regions in proteins is based on estimating the capacity of polypeptides to form stabilizing contacts. The underlying assumption is that globular proteins make a large number of interresidue interactions, providing the stabilizing energy to overcome the entropy loss during folding. In contrast, IUPs have special sequences that do not have the capacity to form sufficient interresidue interactions. Taking a set of globular proteins with known structure, we have developed a simple formalism that allows the estimation of the pairwise interaction energies of these proteins. It uses a quadratic expression in the amino acid composition, which takes into account that the contribution of an amino acid to order/disorder depends not only its own chemical type, but also on its sequential environment, including its potential interaction partners. Applying this calculation for IUP sequences, their estimated energies are clearly shifted towards less favorable energies compared to globular proteins, enabling the predicion of protein disorder on this ground.

The calculation involves a 20 by 20 energy predictor matrix. The parameters of this matrix were derived using globular proteins with known structures only. For these structures, the energy can be calculated using a coarse-grained approach, as a sum of pairwise interactions between amino acid pairs within a distance cutoff. The applied interaction matrix was calculated from the observed frequencies of amino acid pairs using the approach of Thomas and Dill (PNAS 1996, 93, 11628-11633). We aimed to estimate the pairwise energy without a presumed structure, from the amino acid composition only. The parameters of the energy predictor matrix were determined by least squares fitting in a way to obtain the best fit between the energy calculated from the known structures and the energy estimated from the amino acid compositions for all proteins in the dataset overall.

This novel approach is validated by the good correlation of this estimated energy with the values calculated for known structures. When applied for disordered sequences, their predicted energy values was clearly shifted towards less favourable energies compared to globular proteins. This indicates that experimentally characterized disordered proteins have special amino acid compositions, which, independently of the actual sequence, do not allow the formation of favorable contacts expected for folded proteins. Thus, these proteins are indeed intrinsically unstructured.

Based on the significant separation between the estimated pairwise energies of globular and experimentally verified intrinsically unstructured proteins, this approach can be turned into a method to predict protein disorder. Since many proteins are not fully ordered or disordered, we considered the local sequential neighborhood only for this purpose. The original energy predictor matrix P, derived at the level of global sequences, was recalculated by treating each position separately, and taking into account only its predefined neighbourhood in sequence. The energy and amino acid composition for each position was calculated only by considering interaction partners 2 to 100 residues apart. The choice of this range represents a trade-off between the intention of covering most structured domains, but separating distinct domains in multi-domain proteins. This procedure yields an estimated energy at position p of type i:

where Pp is the position specific energy predictor matrix. The position-specific estimation of energies were averaged over a window of 21 residues.

 
 


References:

The Pairwise Energy Content Estimated from Amino Acid Composition Discriminates between Folded and Intrinsically Unstructured Proteins
Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon
J. Mol. Biol. (2005) 347, 827-839.

IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content
Zsuzsanna Dosztányi, Veronika Csizmók, Péter Tompa and István Simon
Bioinformatics (2005) 21, 3433-3434.


 
 
Zsuzsanna Dosztanyi | Peter Tompa | Istvan Simon | Institute of Enzymology