On Stable States in a Topologically Driven Protein Folding Model Zheng Dai1 , David Becerra1 , and Jérôme Waldispühl1 School of Computer Science, McGill University, Montreal, Canada, [email protected], [email protected] Abstract. Theoretical models of protein folding often make simplifying assumptions that allow analysis, yielding interesting theoretical results. In this paper, we study models where folding dynamics is primarily driven by local topological features in an iterative manner. We illustrate the merit of the proposed approach through its ability to simulate realistic protein folding processes even when the sequence content information is reduced to just hydrophobic and polar. We then analyze our models and show that under our simple assumptions certain structures are inherently unstable, and that determining whether structures can be stable is an N P-hard problem. Interestingly, we find that when our model has only two amino acids, the problem becomes solvable in polynomial time. Keywords: Protein folding, inverse folding, complexity, minimalist model, HP model, o↵-lattice 2 1 Introduction The development and analysis of theoretical models simulating the folding of proteins is an e↵ective approach to study a phenomenon that cannot be directly observed. Despite more than 40 years of intensive research, our understanding of the rules shaping the protein folding landscape remains incomplete, mostly because of the sheer size of the systems and inherent computational complexity of the underlying problems. Computer simulations of atomistic models are arguably among the most popular approaches to study biomolecular systems [Karplus and McCammon, 2002]. While those methods are constantly developed and enhanced, they are facing both technological limitations and complexity factors that restrain their progress and analysis. Particularly, these computational approaches are hampered by the inherent complexity involved in modelling the physical systems that (i) limits applications of this technology to the simulation of small molecules [Lane et al., 2013], and more importantly (ii) prevents any formal analysis of the properties of the folding landscape generated by these models. By contrast, minimalist models o↵er a di↵erent perspective to study protein folding mechanisms [Kolinski and Skolnick, 2004,Head-Gordon and Brown, 2003]. The simplification of amino acids properties, reduction of the complexity of interactions, and discretization of the energy landscape allows the conception of mathematically simple models whose behaviour can be analytically calculated. Such insights provides important clues on fundamental principles governing the folding of proteins. One of the earliest example is the Zimm-Bragg model which has been designed to analyze helix-coil transitions [Zimm and Bragg, 1959]. More sophisticated frameworks such as the hydrophobic zipper model [Dill et al., 1993] have been subsequently developed to study cooperativity in protein folding. Among them, the hydrophobic/polar (HP) lattice model, which limits amino acids interaction to hydrophobic contacts and restricts the conformational space to a lattice [Dill, 1985,Istrail and Lam, 2009], probably remains the most popular one. HP models have been used to show that the protein folding problem is N Phard on the 3D cubic lattice [Berger and Leighton, 1998] and even on the 2D square lattice [Crescenzi et al., 1998]. Interestingly, HP lattice models have also been used to show that the complexity of the inverse folding problem (i.e. deciding if it exists a sequence that folds into a pre-determined target structure) can be di↵erent than the one of the folding problem [Yue and Dill, 1992]. Indeed, while the inverse folding problem is polynomial in 2D lattices, the inverse folding problem turns out to be N P-complete on 3D lattice [Berman et al., 2007]. Variants of the HP lattice models allowing the creation of disulfide bridges between cysteines have also been introduced, and efficient algorithms to solve the inverse folding problem have been designed for 2D [Khodabakhshi et al., 2009b] and 3D lattices [Khodabakhshi et al., 2009a]. However, the simplicity of the HP models limits the impact of the theoretical results derived from it on our understanding of the biological principles driving the folding of proteins. First, lattices strongly constrain the accessible 3 conformational landcape [Guyeux et al., 2014]. O↵-lattice HP models were introduced to circumvent this limitation [Istrail and Lam, 2009], but little is known about the computational complexity of the folding and inverse folding problems [Pelta and Carrascal, 2007]. Next, the restriction to the HP alphabet does not enable us to capture the impact of the diversity of amino acids properties. Finally, the dynamics of the folding is not considered either. Simple protein folding models with better expressivity but whose behaviour can still be characterized are thus needed to fill this gap and bring us closer to the biological reality. It is generally accepted that the folding of proteins is an iterative process, in which the protein moves from one configuration to the next until a stable configuration (e.g. minimum energy structure) is reached [Bryngelson et al., 1995]. Although minimalist models and variants of the HP model have been introduced to simulate aspects of this dynamical processes [Hockenmaier et al., 2007,Dill et al., 1993], they did not lead to complexity results. Given the impracticality of sampling the complete conformational space of proteins, a popular approach consists of defining instead a set of rules that explicitly determine how an amino acid must move at each moment in time, taking into account the positions of neighboring amino acids [Krasnogor et al., 2005]. These methods have mainly used cellular automata and Lindenmayer systems to define the rules and dynamics of the folding process. The complexity of the proposed models, given by the enormous number of transition rules, hampered the success, applicability, and convergence of these methods [Reyes et al., 2014]. In this paper, we lay the groundwork for a system on which we can construct models that simulate the protein folding process. Underlying our system is the hypothesis that protein folding is driven primarily through perturbations resulting from local graph topology. Remarkably, in such systems the concept of stable structures emerge naturally as a fixed point in the conformational landscape. We have applied the proposed system to describe a computationally tractable set of models based on the (HP) alphabet and assess their performance on real proteins. Our results demonstrate that although our models remain sufficiently simple to permit a formal analysis of their combinatorial property, they can also achieve satisfying predictive capabilities. This aspect is important to guarantee the biological relevance of our complexity results. We then further investigate some of the theoretical implications of our system. Specifically we demonstrate that determining whether specific structures can be stable is an N P-hard problem. To our knowledge it is the first result to suggest that the difficulty of the inverse folding problem scales with the size of the protein as opposed to the energy model in an o↵-lattice model. Interestingly, we also find that if we restrict our alphabet to two amino acids (e.g. HP model) the stable structure decision problem becomes solvable in polynomial time. These results are suggesting the practical difficulty to identify “designable” protein structures. 4 2 2.1 Methodology The Role of Local Topology in Folding Dynamics A protein is a chain of amino acids held together through peptide bonds. This chain usually folds into a particular conformation, which are held together with interactions between non-adjacent amino acids. Protein structures can therefore be easily described by a graph, where each node represents an amino acid, and each edge represents an interaction. This representation of protein structures provides us a convenient framework to study the folding process [Vendruscolo et al., 1999]. In order to accurately represent the dynamic process used to transform an unfolded polypeptide into a stable fold, we need an additional information. Since peptide bonds are more stable than most other interactions, we need a way of di↵erentiating between edges that represent peptide bonds than those that represent other interactions. Here, we will refer to edges representing peptide bonds as permanent, and we will refer to the other edges as transient. For convenience, we will also say each node has a permanent edge pointing to itself. In this paper, our key assumption about folding dynamics is that protein folding is driven primarily though perturbations resulting from local topology. We argue that this principle models an important aspect of the dynamics of protein folding. Here, we also assume that folding occurs in discrete time steps, and that folding is deterministic. At a given time step, for any pair of amino acids, it may exist an edge between them. If the edge is permanent, then the edge will exist in the next time step, and for all future time steps. However, if the edge is transient, or if there does not exist an edge between them. Then the existence of the edge between them in the next time step depends on neighbourhood of that pair, and the identities of the amino acids. The neighbourhood of a contact between two amino acid is easy to describe in the context of contact maps, which are equivalent to adjacency matrices since we are using here graph terminology. The order of the rows and columns of the matrix corresponds to the primary sequence of the protein. An entry of this matrix is equal to 1 if an edge exists (i.e. a contact exists) and 0 otherwise. Each pair (i, j) of amino acids is an entry in the adjacency matrix. In this paper, we define the neighbourhood N (i, j) of that pair (i, j) to be the 3 by 3 sub-matrix with that entry in the middle (i.e. the set of contacts N (i, j) = {(i 1, j 1), (i 1, j), (i 1, j + 1), (i, j 1), (i, j), (i, j + 1), (i + 1, j 1), (i + 1, j + 1), (i+, j + 1)}). For any pair that lies on the edge of the adjacency matrix, we will consider the undefined entries in the neighbourhood to be 0. Using this definition, we can see how local features can propagate through the protein. In figure 1 we provide a toy model that illustrates how the model works, and how features may arise. Further motivating our definition of a neighbourhood is the strong parallels it has to models of statistical potential that also makes use of local structures. We can now define a transition function that takes as input a pair of amino acids (i, j) with its neighbourhood, and output the status of the same edge (i, j) 5 Fig. 1: A toy model with two amino acids ”A” and ”B”. In here the transition function only cares about two cells in the matrix, (1,1) and (1,2), and it converges to a cycle of length 2 on the sequence ”AAAABBBB”. at the next time step. More precisely, the transition function outputs a 1 or a 0, where 1 means an edge should exist in the next time step, and 0 means an edge should not exist in the next time step. Finally, we can define an overall step function, which takes a graph as an input, and computes the transition function on each pair of vertices that does not have a permanent edge. Importantly, the original graph is left unchanged throughout this computation. The individual results of the transition functions are instead cached in a new graph, which replaces the original one at the end of the computation of the overall step function. This new graph contains all permanent edges the original graph had, and has a transient edge for each pair of vertices that fulfills the following 3 requirements: (i) the pair does not have a permanent edge between them in the original graph; (ii) the transition function 6 outputs a 1 for the pair; and (iii) if an edge did not exist between them in the previous graph, then the degree of neither vertices can exceed a constant we will call the interaction limit. The reasoning for the last restriction is because there is no implicit constraint in our formulation of protein folding that stops a protein from transitioning into states that are physically infeasible. Algorithm 1 illustrates the process described above. Noticeably, we initialize the contact map on line 2, and refer to an arbitrary convergence criterion on line 13. These steps are left undefined here in this generic algorithm. However, in our experiments we define the initial state as the matrix consisting of 0 everywhere except for the main diagonal and sub-diagonal, which can be interpreted as the graph containing only permanent edges. We also use a convergence criterion that halts the process after a specified number of steps. Algorithm 1 Protein Folding Process 1: Take an amino acid sequence S of length n as input 2: Initialize an n ⇥ n matrix A to an initial state 3: loop 4: Initialize a blank n ⇥ n matrix B 5: for each cell (i, j) in B where j i do 6: if | i j | 1 then 7: Set Bi,j to 1, since it is a permanent edge 8: else 9: Construct a 3 ⇥ 3 matrix U from cell Ai,j , and the 8 cells surrounding it 10: If any cells land outside the matrix, pretend its value is 0 11: Feed U , Si , and Sj to the transition function and set Bi,j to its output 12: Set A to B 13: if some convergence criteria is met then 14: return A 2.2 HP-X8 Models Our description above requires parametrization of the transition function before it can be used to iterate on graphs to simulate folding dynamics. We first need to set the number of di↵erent amino acids the transition function is capable of recognizing, and what the transition function outputs for every unique set of inputs. We will refer to such parametrization of a transition function as a model. In order to verify the validity of our overall approach, we need to find models that have the ability to simulate biological phenomena. Searching for such a model in the full space of possible models is obviously intractable, so instead we introduce a subset of the model space which is both tractable and contains models that are complex enough to capture biological phenomena. First, we reduce the amino acid alphabet from 20 letters down to 2 (Hydrophobic and Polar), and limit the number of interactions to 8. Next, in the transition function, we will only consider two types of amino acid inputs: either both amino acids are hydrophobic, or not. In particular, it means that for any matrix M, the transition 7 function will give the same output on inputs (M,P,P) and (M,H,P), where H and P are the hydrophobic and polar amino acids respectively. Next, for the matrix input, the transition function for a pair of amino acids (i, j) will only consider the entries at the coordinates (i 1, j 1), (i 1, i + 1), (i, j), (i + 1, i 1), and(i + 1, j + 1), which together form an X shape. We also consider two matrices to be equivalent if they are identical after a series of 90 degree rotations, which means the transition function ignores the relative linear positions of the amino acids on the chain. Under these hypotheses, our model can be defined using the 12 types of matrix inputs shown in figure 2. Finally, we will say that (H,H) interactions are stronger than all other interactions. It means that if the transition function outputs 0 for a matrix and an (H,H) input, it must output 0 for any other amino acid inputs. We call this set of models HP-X8 models. Due to these restrictions, there are only a total of 531441 models with di↵erent parameters. This reduction in complexity allows brute force enumeration and testing of these models; however, there is a significant sacrifice in the amount of information that is lost. Although it is unlikely for HP-X8 models to be able to make detailed predictions of protein dynamics, we find that some of these models do capture interesting phenomena, and their ability to do so despite the loss of information are indicative of the expressivity and biological relevance that our overall approach has. Fig. 2: All unique input matrices of HP-X8 models. 3 3.1 Results and Discussion Brute Force Enumeration We tested all HP-X8 models on random sequences, and we considered those that did not halt within 5 time steps to be viable, since heuristically speaking 5 time steps would not be enough to generate interesting structures. For our experiments, we always assumed that folding starts with no transient edges. This filtering process left us with 55404 viable models. In order to test the viable models, we extracted a diverse set of 95 proteins from the BetaSheet916 dataset [Cheng and Baldi, 2005], with sizes ranging from 8 41 to 120 residues, for which the native 3D structures are available with a resolution better than 2.5Å. The extracted proteins belong to all of the main protein families (such as ↵/ , ↵ + and all- ) as defined by SCOP. The sequence identity between any pair of sequences in the data set is between 15 20%, and only proteins containing less than 6 strands were extracted. BetaSheet916 is a benchmark that has been routinely adopted set for -residue pairs, -strand pairs, and -sheet prediction methods. Additionally, the proteins in BetaSheet916 have played a central role in protein folding studies by being the system of choice in a vast body of experimental and theoretical studies. The table for reduction of the 20 amino acids to the HP alphabet is provided in figure 3. Fig. 3: Translation table for the 20 amino acid alphabet to the HP alphabet. For each protein we ran each model for 100 time steps and computed the harmonic mean of the precision and recall (F1 Score). For the purposes of scoring how well a model performs, we define an interaction as existing if a pair of amino acids are closer than 10nm apart, and we say we have a true positive if a predicted interaction lies within 2 amino acids of the real contact, since those parameters appear produce the best results. Due to the abundance of short range contacts, models that generate large amounts of those indiscriminantly usually end up with the best F1 score. However, it does appear to be a decent heuristic, as some models that do score reasonable well do display interesting features. Therefore, we took the 50 models that scored best with the F1 score and ran each model for 200 time steps, this time applying a complementary scoring function: X q p p p F1 ⇥ ( cd d 4 + c4 + c3 ) (1) d25,6,7... Where cd is the number of contacts detected that are d amino acids apart, and F1 is the F1 score. This function takes into account that it is better to have a diverse range of contacts at all distances by using a sum of concave functions, and includes a range dependent factor so that long range contacts are more heavily favoured than close range ones. We also performed the whole process for a control set, which consisted of the same proteins, but we shu✏ed the sequence of the amino acids before starting the process. Figure 4 contains some examples of the resulting contact maps, and a comparison to the controls. (A comprehensive list of the results can be found in the supplementary material). Figure 4 illustrates an interesting observation. We discover that for many sequences there exists an HP-X8 model that takes a protein from a state with 9 no transient edges to one whose adjacency matrix reasonably resembles the real protein contact map. These results support our hypothesis by showing that local perturbations in topology is indeed capable of generating protein folds that look realistic. By comparing our results to our controls, we observe that without sequence information, HP-X8 models are less capable of carrying a protein into a state that resembles the real protein contact map, which rules out the possibility that our results are a consequence of the robustness of the HP-X8 models. We have included the data pertaining to this comparison in the supplementary materials. Fig. 4: On the top row are optimal HP-X8 models iterated on 1B13, 1SOY, and 1NR2 for 190, 197, and 155 time steps respectively from left to right. Below them are the results of the same experiment on the same proteins, where the sequence was randomized. The lower triangle is the contact map extracted from the PDB, while the upper triangle is the output of the model. 3.2 Consensus Model In the previous section, we have provided evidence that our approach has merit. Nonetheless, if our models are reflective of reality, it is reasonable to suspect that all the models we have captured in the last experiment are noisy approximations of a single model. Therefore, we would like to assess the degree to which the models we have captured approximate reality. To this end we measure the agreements between the transition functions of the captured models. We allow each protein to place a vote for what should happen for each input for the transition function, and we have recorded the results of the vote in 10 figure 5a. Under the null hypothesis that each protein selects its best model uniformly at random, we observe statistically significant (p-value under 0.05) agreement for all inputs except 3. Similarly, we also checked model agreement for our control set in the same way. Figure 5b shows our results. The overall agreement levels seem to be approximately the same as our result set. Unsurprisingly, these results suggest that the HP-X8 models are far too simplistic to fully capture a general underlying model. This is quite plausible, since models based on the HP alphabet lose a large amount of information when transitioning from a 20 letter alphabet to a 2 letter alphabet. Despite this, it is clear that our framework captures a signal, since there are statistically significant agreements in most of the rules. We hypothesize that we are capturing the necessary rules that are required for a transition function to generate biological structures. Our control experiment supports this notion, since the aggregate function from the control experiment is very similar to the aggregate function from our main experiment with similar levels of agreement. (a) The agreements between the best models of each protein. (b) The agreements between the best models of each protein for the control experiment. Fig. 5: Consensus results 11 3.3 Stable States We are now presenting the main theoretical results of this study. Importantly, in the remainder of this manuscript we will consider the full generalized models instead of the sub-set of HP-X8 models we have previously used to demonstrate the biological relevance of this system. Our study aims to characterize the difficulty for a sequence to have a stable conformation (i.e. state) in this framework. Interestingly, our models enable us to naturally define the concept of a stable state. Let a state st be a graph (or its adjacency matrix) representing a protein structure at a given time step t. Using our models, a state gives all the information required to compute the next state t + 1. Thus, we can decide if a folding is complete if the next state st+1 given by the step function is identical to the current state st . The state st is this called a stable state (or stable structure). Intuitively, the stability of a state is dependent on the sequence. This observation leads naturally to several interesting questions such as: given a transition function, are there certain states that are inherently unstable, and if yes, is it possible to find them? Such questions are closely related to the problem of protein design, since if a state is inherently unstable it becomes also undesignable. To answer these questions, let us consider a problem which we will call the stable state characterization problem. Definition 1. Given a graph with n vertices representing a state and a set of k amino acids along with a transition function, determine if there exists a sequence of n amino acids taken from the set such that the state is a stable state. Theorem 1. Stable state characterization is N P-hard. Proof. Stable state characterization is polynomially reducible from vertex colouring, which is a classical N P-complete problem. It suffices to introduce amino acids which correspond to colours, and make edges unstable when two amino acids representing the same colour interacts with each other, and make edges stable when the amino acids are di↵erent. As a construction detail, we make sure that edges do not spontaneously form. A string of amino acids that stabilizes the state must then encode a graph colouring. Note that the interaction limit does not come into play here, since that only restricts the formation of new edges, and not whether an edge could break. As a construction detail, we need to interlace the original graph with noninteracting amino acids. We can take the adjacency matrix representing the original problem, and interlace blank rows and columns between each row and column of the original adjacency matrix. The reason this is required is because adjacent amino acids are connected by default, which would limit the graph topologies that could be encoded if we do not interlace. If a sequence that stabilizes the state exists, then we can read the colouring o↵ the parts of the sequence that correspond to the original matrix. The colouring would be valid, since otherwise one of the edges would be unstable. Conversely, 12 if a valid colouring exists, then we can make a sequence of amino acids corresponding to that colouring and interlace that sequence with any amino acid, since the corresponding rows and columns are blank and therefore stable. This will provide a sequence that would stabilize the new adjacency matrix, since the rows and columns in the original adjacency matrix would be stabilized by the colouring by our construction. Thus, a solution to the stable state characterization problem exists if and only if a solution to the original vertex colouring problem exists. If the original graph was also represented as an adjacency matrix, then the new matrix can be created in linear time. Thus, there is a polynomical reduction from vertex colouring to stable state characterization, and therefore stable state characterization is N Phard. t u Corollary 1. Stable state characterization is N P-complete. Proof. This follows immediately from the above theorem, since a sequence of amino acids that allow the state to be stable would be a suitable certificate. The evaluation given such a sequence takes time O(n2 ), where n is the length of the sequence. t u Thus, in the general setting it is impossible to provide an efficient method of deciding whether a particular structure can be stable or not, unless P = N P. remarkably, we note that this also proves that there can be states that are inherently unstable. Since a vertex colouring problem can encode 3-SAT at 3 colours, stable state characterization can be N P-complete if the transition function uses 3 amino acids or more. But perhaps the problem only becomes N P-complete when the states are unrealistic, for instance when a single amino acid has an unrealistic amount of interactions. The following theorem shows that that is not the case. Theorem 2. Vertex colouring with k colours where the maximum degree is 2k 1 is N P-complete, given that k > 2. Proof. We can reduce to this problem from arbitrary k-colouring. Note that if a vertex has more than 2k 1 incoming edges, we can construct a device, where two unconnected nodes are both connected to the same k 1 other nodes. This would then require the two nodes to be the same colour. We can keep doing this to create a chain, where each link must be the same colour. This way, each link only requires one more incoming edge, thus the maximum degree of this graph is 2(k 1) + 1 = 2k 1. Since each node originally can have at most as many incoming edges as there are nodes, this modification represents at most a polynomial increase in the number of nodes, thus the reduction is polynomial. t u From the way our proof of N P-completeness is constructed, it follows from the theorem that realistic instances of the stable state characterization problem do not necessarily simplify the problem. 13 3.4 Implications for the Inverse Folding Problem The results presented in the last section has implications for the inverse folding problem, where given a structure one has to produce a sequence that can stabilize the structure. If the inverse folding problem can be solved efficiently, then we can create an oracle algorithm to solve stable state characterization efficiently, where we run inverse folding on our sequence, and output whether a sequence was found or not. Thus it follows that there exists no efficient algorithm for inverse folding, unless P = N P. However, this conclusion must be completed. In our model, it is possible that a stable conformation for a protein is represented by a small cycle of states rather than by a single stable state. Since the size of the cycle roughly corresponds to the stability of the conformation, an analog of the stable state characterization in this case is to find a sequence for which the cycle the state is a part of is minimized. This is also N P-hard, since if the optimal cycle contains only 1 state, a correct optimization algorithm will find a sequence that allows this, and will in turn solve the stable state characterization problem. In our reduction of the stable state characterization problem, we showed that for a specific model with fixed parameters, the inverse folding problem is N Pcomplete. To our knowledge, this is the first time evidence has been presented which shows that the difficulty of the inverse folding problem scales with the size of the length of the protein, as opposed to the size of the parameters of the model in o↵-lattice models. 3.5 Complexity of Stable State Characterization with 2 Amino Acids Our proof above show that stable state characterization becomes N P-complete at 3 amino acids. We will now show that for any transition function that only recognizes 2 amino acids, the problem lies in P. In particular, this shows that for the HP-X8 models, stable state characterization can be done efficiently, and therefore the inverse folding problem can also be solved efficiently. Theorem 3. Stable state characterization with 2 unique amino acids is in P. Proof. For simplicity, we will show this by reducing the problem to 2-SAT, which can be solved in polynomial time, instead of providing an actual algorithm. Let us call the 2 amino acids T and F . We will also assign each amino acid a boolean value, such that T is true and F is f alse. Let S be a sequence consisting of T and F , and let A be the adjacency matrix representing the state we want to stabilize. Each cell, based on its surroundings, can be constrained in a way that can be expressed as a 2-CNF, where a and b are the interacting amino acids. 14 The The The The The The The The cell cell cell cell cell cell cell cell is is is is is is is is stable only if the interaction is between T and T stable only if the interaction is between F and F stable only if the interaction is between T and F unstable only if the interaction is between T and T unstable only if the interaction is between F and F unstable only if the interaction is between T and F always stable always unstable (a _ a) ^ (b _ b) (¬a _ ¬a) ^ (¬b _ ¬b) (a _ b) ^ (¬a _ ¬b) (¬a _ ¬b) (a _ b) (a _ ¬b) ^ (¬a _ b) (a _ ¬a) (a _ a) ^ (¬a _ ¬a) We join all this with conjunctions to produce the 2-CNF, which can be satisfied if and only if there exists a sequence of T and F which stabilize the state. This reduction produces a 2-CNF that is no larger than a constant factor of the size of the adjacency matrix. Thus, this reduction can be done in polynomial time, and since 2-SAT can be solved in polynomial time, this problem can be solved in polynomial time. Therefore, stable state characterization with 2 unique amino acids is in P. t u 3.6 Expressivity of Transition Functions It is easy to solve stable state characterization if we restrict the number of amino acids we use to 2. So is there any reason to use any more? The following result shows that systems with more amino acids are more expressive in terms of the stable states that it can allow, implying that there is value in having a system that recognizes more amino acids. Theorem 4. Suppose we are given an adjacency matrix which requires n unique amino acids to ensure its stability. Then supposing that no interaction of an amino acid with itself is completely stable, we can construct an adjacency matrix which must require k amino acids to stabilize, where k > n. Proof. We construct a new adjacency matrix, where we start o↵ by taking the original adjacency matrix and adding an extra row to the top and bottom, and an extra column to the left and right. We then take this matrix and copy it diagonally n times. If this results in a adjacency matrix which requires n + 1 unique amino acids to ensure stability, we are done. Otherwise, we required at least n unique amino acids to ensure stability of this new adjacency matrix, since each submatrix that resembles the original matrix requires at least n unique amino acids to ensure stability. Furthermore, in any stabilizing sequence, it is possible to pick a set of n amino acids such that they are all distinct, and there are at least 2 amino acids between any two selected, since each submatrix requres all n amino acids to stabilize. Let S be the set of (sequence ⇥ transitionf unction) tuples that ensure stability of the matrix with only n unique amino acids in the sequence. Assume that each transition function makes use of at most n amino acids. This is inconsequential, since if only n unique amino acids are used, then any transition 15 function using a larger set of amino acids can be truncated to only those used. Therefore, S is clearly finite. For each element s present in S, add 3 rows to the bottom of the adjacency matrix. Then pick out n unique amino acids in the sequence such that no 2 amino acids are spaced closer than 2 amino acids apart. For each amino acid, find the cell at which that amino acid interacts with the one on the 2nd row that was added, then arrange that cell such that if the identity of the new amino acid was itself, it would be unstable according to the transition function. This is always possible by our assumption. Once we have finished doing this, it ensures that for that particular sequence and transition function, there exists no amino acid that stabilizes that contact matrix. If we do it for all (sequence ⇥ ruleset) tuples, then there exists no ruleset in which some sequence can stabilize the adjacency matrix with less than n elements. Thus more than n unique amino acids are required to stabilize this contact matrix. The new number is clearly finite, since if we have a sequence where each element is unique, then trivially there is a transition function which stabilizes the contact matrix. t u Here, we place the restriction that no interaction of an amino acid with itself is completely stable because if that is the case, every state can be stable if we take a sequence consisting of just that amino acid. Since such system is not a very expressive, we exclude it from our consideration. Also, we note that since for any matrix we know we need at least 1 amino acid to have stability, the induction can begin. Although this result has some limitations. Indeed, we have not proven that with each amino acid added we can improve expressivity. However, we have shown that if we keep adding amino acids we will eventually improve. We have also not considered the role of interaction limits, but we can always set it arbitrarily high so that the theorem works. 4 Conclusion We have presented in this work a system which lays the groundwork for an energy independent paradigm of protein folding. As we have demonstrated, even with relatively simple parametrization our system is capable of producing surprisingly reasonable predictions. An analysis of the specific parameters that produce good models suggests that on some level, processes occurring in reality are being replicated by our system. We have also provided some preliminary analysis into the general properties of the models of this type. We have shown that inverse folding, under this paradigm, is intractable, and thus may require simplifications and approximations. We also showed that certain kinds of simplifications, while they make problems more tractable, also reduce the robustness of the models. Nonetheless, many questions are still open, in particular we did not investigate stable cycles of states, which intuitively should be more frequent than 16 stable states as the end product of a folding process. Although we have established a connection between stable states and stable cycles through the problem of minimizing the length of a stable cycle, a thorough investigation has yet to be done. We believe that the work presented here provides a strong basis for future work in theoretical protein folding. We hope that future work will further elucidate the features of this paradigm, and use it to create novel and interesting models of protein folding. 17 References [Berger and Leighton, 1998] Berger, B. and Leighton, T. (1998). Protein folding in the hydrophobic-hydrophilic (hp) model is np-complete. J Comput Biol, 5(1):27–40. [Berman et al., 2007] Berman, P., DasGupta, B., Mubayi, D., Sloan, R. H., Turán, G., and Zhang, Y. (2007). The inverse protein folding problem on 2d and 3d lattices. Discrete Applied Mathematics, 155(6-7):719–732. [Bryngelson et al., 1995] Bryngelson, J. D., Onuchic, J. N., Socci, N. D., and Wolynes, P. G. (1995). Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21(3):167–95. [Cheng and Baldi, 2005] Cheng, J. and Baldi, P. (2005). Three-stage prediction of protein -sheets by neural networks, alignments and graph algorithms. Bioinformatics, 21(suppl 1):i75–i84. [Crescenzi et al., 1998] Crescenzi, P., Goldman, D., Papadimitriou, C., Piccolboni, A., and Yannakakis, M. (1998). On the complexity of protein folding. J Comput Biol, 5(3):423–65. [Dill, 1985] Dill, K. A. (1985). Theory for the folding and stability of globular proteins. Biochemistry, 24(6):1501–9. [Dill et al., 1993] Dill, K. A., Fiebig, K. M., and Chan, H. S. (1993). Cooperativity in protein-folding kinetics. Proc Natl Acad Sci U S A, 90(5):1942–6. [Guyeux et al., 2014] Guyeux, C., Côté, N. M.-L., Bahi, J. M., and Bienia, W. (2014). Is protein folding problem really a np-complete one? first investigations. J Bioinform Comput Biol, 12(1):1350017. [Head-Gordon and Brown, 2003] Head-Gordon, T. and Brown, S. (2003). Minimalist models for protein folding and design. Curr Opin Struct Biol, 13(2):160–7. [Hockenmaier et al., 2007] Hockenmaier, J., Joshi, A. K., and Dill, K. A. (2007). Routes are trees: the parsing perspective on protein folding. Proteins, 66(1):1–15. [Istrail and Lam, 2009] Istrail, S. and Lam, F. (2009). Combinatorial algorithms for protein folding in lattice models: A survey of mathematical results. Communications in Information and Systems, 9(4):303–346. [Karplus and McCammon, 2002] Karplus, M. and McCammon, J. A. (2002). Molecular dynamics simulations of biomolecules. Nat Struct Biol, 9(9):646–52. [Khodabakhshi et al., 2009a] Khodabakhshi, A. H., Manuch, J., Rafiey, A., and Gupta, A. (2009a). Inverse protein folding in 3d hexagonal prism lattice under hpc model. J Comput Biol, 16(6):769–802. [Khodabakhshi et al., 2009b] Khodabakhshi, A. H., Manuch, J., Rafiey, A., and Gupta, A. (2009b). Stable structure-approximating inverse protein folding in 2d hydrophobic-polar-cysteine (hpc) model. J Comput Biol, 16(1):19–30. [Kolinski and Skolnick, 2004] Kolinski, A. and Skolnick, J. (2004). Reduced models of proteins and their applications. Polymer, 45(2):511 – 524. Conformational Protein Conformations. [Krasnogor et al., 2005] Krasnogor, N., Terrazas, G., Pelta, D. A., and Ochoa, G. (2005). A critical view of the evolutionary design of self-assembling systems. In Artificial Evolution, 7th International Conference, Evolution Artificielle, EA 2005, Lille, France, October 26-28, 2005, Revised Selected Papers, pages 179–188. [Lane et al., 2013] Lane, T. J., Shukla, D., Beauchamp, K. A., and Pande, V. S. (2013). To milliseconds and beyond: challenges in the simulation of protein folding. Curr Opin Struct Biol, 23(1):58–65. [Pelta and Carrascal, 2007] Pelta, D. A. and Carrascal, A. (2007). Inverse protein folding on 2d o↵-lattice model: Initial results and perspectives. In Evolutionary 18 Computation,Machine Learning and Data Mining in Bioinformatics, 5th European Conference, EvoBIO 2007, Valencia, Spain, April 11-13, 2007, Proceedings, pages 207–216. [Reyes et al., 2014] Reyes, J. S., Villot, P., and Diéguez, M. (2014). Emergent protein folding modeled with evolved neural cellular automata using the 3d HP model. Journal of Computational Biology, 21(11):823–845. [Vendruscolo et al., 1999] Vendruscolo, M., Najmanovich, R., and Domany, E. (1999). Protein folding in contact map space. Phys. Rev. Lett., 82:656–659. [Yue and Dill, 1992] Yue, K. and Dill, K. A. (1992). Inverse protein folding problem: designing polymer sequences. Proc Natl Acad Sci U S A, 89(9):4163–7. [Zimm and Bragg, 1959] Zimm, B. H. and Bragg, J. K. (1959). Theory of the phase transition between helix and random coil in polypeptide chains. The Journal of Chemical Physics, 31(2):526–535.
© Copyright 2026 Paperzz