Computer Science Technical Reports Technical Report No. 1997/05 StructWeb: Biosequence structure searching on the Web using clp(FD) David Gilbert, Ingvar Eidhammer and Inge Jonassen May 1997 ISSN 1364-4009 City University Dept. of Computer Science Northampton Square London EC1V 0HB United Kingdom StructWeb: Biosequence structure searching on the Web using clp(FD) Ingvar Eidhammer Department of Informatics University of Bergen Department of informatics HIB N-5020 Bergen Norway [email protected] Inge Jonassen Department of Informatics University of Bergen Department of informatics HIB N-5020 Bergen Norway [email protected] David Gilbert Department of Computer Science Northampton Square, London EC1V 0HB, United Kingdom, email: [email protected] Abstract We describe an implementation in a nite domain constraint logic programming language of a web-based biosequence structure searching program. We have used the clp(FD) language for the implementation of our search engine and have ported the PiLLoW libraries to clp(FD). Our program is based on CBSDL, a constraint based structure description language for biosequences, and uses constrained descriptions to search for the secondary structure of biosequences, such as tandem repeats, stem loops, palindromes and pseudo-knots. Keywords: constraints, biostructures, description language, searching, WWW interface. 1 Introduction In this paper we report an implementation of a web-based biosequence structure searching program. This implementation has been constructed using the nite domain constraint logic programming language clp(FD) [9] together with our own port of the PiLLoW library [7] to clp(FD). Our search engine is based on a CBSDL, a constraint based structure description language for biosequences described in [11] and uses constrained descriptions to search for the secondary structure of biosequences, such as tandem repeats, stem loops, palindromes and pseudo-knots. In this paper we describe the background to our structure searching language, the implementation of the search engine, issues encountered in porting the PiLLoW libraries to clp(FD) and the Web interface that we have constructed for our search engine. 1 2 Biosequences and structures 2.1 Biological motivation Biological macromolecules, DNA's, RNA's, and proteins, are chains of relatively small organic molecules. The dierent types of these organic molecules are few { there are 4 dierent bases for DNA's and RNA's and 20 dierent amino-acids for proteins. Conventionally alphabets for DNA/RNA are lower case and protein in uppercase. A macromolecule can be coded as a string over an alphabet of size 4 (for DNA/RNA), or 20 (for proteins) starting from one end of the chain and moving towards the other. The strings for DNA/RNA molecules are called nucleotide sequences , and each element in such a sequence is called a base . The strings for protein molecules are called protein sequences , and each element in such a sequence is an amino-acid (residue ). Collectively nucleotide and protein sequences, are called bio-sequences, or just sequences . Sometimes we will also refer to them simply as strings. Watson and Crick discovered in 1953 that DNA forms a double helix where a base in one strand is bonded to a complementary base in the other strand (chain), and the so-called Watson-Crick base pairs are a-t and g-c. The bases in RNA molecules can form bonds a-u and g-c in a similar way. RNA and protein molecules fold into 3 dimensional structures enabling them to perform their structural/functional role in the cell. The structures can be described at dierent levels. For RNA molecules, the secondary structure is the collection of base pairs which are formed in the folded molecule, and the tertiary structure is the complete 3 dimensional structure of the folded molecule. as for RNA molecules. An important problem in molecular biology is the prediction of the biological properties of a macromolecule from its sequence, in particular the prediction of the structure and function of an RNA molecule or a protein from its sequence. Proteins may be grouped into families where the members of a family have similar structures. If the structure of one family member is known, this helps in nding the structure of the other proteins in the family. Features that are common to the sequences of the proteins in a family can be expressed in a pattern, and a new sequence can be hypothesised to belong to the family if it ts the pattern. Most languages used to dene patterns for protein sequences permit only the denition of what we call sequential patterns. Sequential patterns give sucient expressive power to describe sequence features that are characteristic for many protein families. This is illustrated by the PROSITE protein family database which gives descriptive patterns for most of its families [3]. For describing patterns in RNA sequences, one needs to include dependencies between individual letters because the base-pairing interactions (most importantly a-u, g-c and g-u) play a dominant role in determining RNA structure and function [17]. Figures 1 (i) and (ii) shows two stem-loops (dened later) that might be structurally and functionally equivalent in RNA molecules. It does not matter which bases (symbols) are present in the sequence in order for a stem-loop to be formed, as long as the sequence contains two substrings of some minimum and identical lengths and which are reverse complement of each-other. We will call such patterns of dependencies structures in the sequences, and note that such patterns can be described using structural patterns as dened in Section 2.3 below. We can also describe other structures found in RNA and DNA molecules such as clover-leafs and pseudo-knots. Finding a match to a structural pattern in an RNA sequence does not imply that the corresponding molecule in its native folded state will have the base pairing described by the pattern. It is believed that the native structure will be one with minimum free energy, and another set of base pairings than the one described by the pattern, might give a lower free energy. 2 c a-u g-c u-a c-g augc ggcau g u-a a-u g-c c-g aggc ccgu x o-o o-o o-o o-o xx xx x o-o o-o o-o c-o xx xx (i) (ii) (iii) (iv) Figure 1: Illustration of structures and structural patterns: (i) and (ii) show two examples of structures (stem loops) that might be equivalent in RNA molecules. Watson-Crick base pairing is between a and u, and between g and c. Other base pairings are also possible. Figure (iii) shows a possible representation of a pure structural pattern matching the structures (i) and (ii). The pattern can also be called a consensus for the structures in (i) and (ii). The o and x symbols each match any one nucleotide symbol, and pairs of o symbols which are connected with a dash (-) should match pairs of symbols that can base pair. The x symbols are wildcards - one x matches any one symbol. Figure (iv) shows a structural pattern equivalent to the pattern shown in (iii) except that the rst nucleotide in the rst part of the stem has to be a c { a restriction on the content of the substrings to match the pattern. Structural patterns should not be used alone to predict the secondary structure of RNA, but can be used in conjunction with structure prediction methods to provide hypothesises of possible folds. This can be done eciently because matching a string against a structural pattern is computationally cheap compared to structure prediction. Another advantage of using structural patterns, is that they can be used to describe complex structures which are not allowed when using dynamic programming based structure prediction. We postulate that algorithms can be developed for nding conserved structural patterns in a set of RNA sequences analogous to algorithms for nding conserved sequential patterns in sets of protein sequences [5], and will investigate this in further work. In this way structural patterns allow for description of potentially interesting conserved structures in sets of related biosequences. Structures are also found in DNA sequences that can be described using structural patterns but not using sequential patterns. This includes structures such as repeats and palindromes. Repeats are abundant in genomic DNA, both in coding and in non-coding areas, and for instance recognition sites for restriction enzymes are often palindromes. 2.2 Example structures In the structure description below ; (with or without indices) are pattern components and x is a wildcard (matching any one letter in an input string), r is the reverse of , and c is the complement of . rc is the reverse complement of . We have identied the following structures in the literature, see for example [18, 4]. For each type we give one example. All examples are from DNA/RNA sequences, except for the last which is from a protein sequence. 3 Tandem repeat acgacg Simple repeat acgaaacg Multiple repeat 1 acgaaacguuacg Stem loop rc acgaacgu Attenuator rc 1 acgaacguauacg Palindrome, even r acggca Palindrome, odd xr acgagca rc rc Pseudoknot 1 2 1 1 2 2 acgaaucugccguauaaga Sense - antisense c IVLSPANHK More complicated structures can be obtained by combining the ones above, e.g. clover-leafs. 2.3 Constrained patterns in biosequences The goal of our research underpinning the development of our search engine is to investigate how constraint solving techniques can be used to search for structural patterns in sequences (or strings) of symbols over a nite alphabet . The main motivation is searching in biological sequences, and also in providing high-level descriptions of biosequence database contents, but we believe that programs for searching for such patterns also might be useful in other areas as well, e.g. signal processing or treating of acoustics data. We dene a pattern as consisting of a logical expression on components and a set of unary and binary constraints on the components where a component is a description of a string of symbols. An input string S matches a pattern if for each component it contains a substring matching that component, such that all the constraints are satised. A pattern can contain constraints of the ve following types on the: 1. length of a substring to match a specic component, 2. distance (in the input string) between substrings to match the dierent components of a pattern, 3. contents of a substring to match a component, e.g. the second symbol should be an a or a t. 4. positions on the input string where a particular component can match, 5. correlation between two substrings matching dierent components, e.g. the substrings should be identical, or the reverse of each other. We also dene three associated classes of patterns: Sequential: patterns which do not include a correlation constraint. The patterns in the PROSITE data base [3] are examples of this class, for example [AC]-x(2,3)-D describing a pattern comprising three components, the rst being an A or a C, the second of length 2 or 3 and the last consisting of a D. Pure structural: patterns including at least one correlation constraint and no content constraints. One example is repetition, where the substrings matching two dierent 4 components must be identical. Another example is a palindrome, two consecutive substrings of equal length must be the reverse of each other. Structural: patterns having at least one correlation constraint and one content constraint. One example is a palindrome, beginning with an a. 3 The structure language A pattern in our language is dened by a structure specication , which is of the form S; c1 ; : : : ; cn where S is a string expression and c1 ; : : : ; cn is a set of constraints. The string expression species the components, or string variables (denoted by the Greek letters ; ; ; : : : and possibly subscripted), taking part in the pattern, and a logical expression on them using conjunction, disjunction and negation. We follow the convention that A set of constraints can contain constraints over the ve types: length, distance, content, position and correlation constraints, which are described below. In addition, we permit equality and inequality operations over the integer components of the constraints, with the usual arithmetic operations over integers, addition, subtraction, multiplication and integer division. We further allow the user to describe complex structures by conjoining structure descriptions. A length constraint restricts the length of a string variable to be within a particular range, and has the form length(, L) where is a string variable and L ranges over the positive integers such that the length of is constrained to be within the range of L. We permit the length of a string variable to be 0 in order to be able to describe null-strings. Furthermore, we introduce two variants, maxlength(, L) and minlength(, L) such that the length of is the maximum, respectively minimum value possible within the range denoted by L according to some mapping to a given input string. Redundant matches are avoided in the case of e.g. stem loops where substrings of the stem are not required. A distance constraint restricts the distance between two string variables, and are specied in a declarative and uniform way, e.g. start start(; ; D), end start(; ; D), start end(; ; D), end end(; ; D) where and are string variables and D ranges over the integers. These relations constrain the distance between the start of and start of (respectively end of and start of , start of and end of end of and end of ) to lie within the range denoted by D. A negative value for D indicates that the point of reference of occurs after the corresponding point of reference of in the input string. We also permit the shorthand : to indicate that starts directly after . This shorthand is equivalent to ^ , end start(, , 1) . A content constraint restricts which symbols can be in a specic position on a string variable matching a component and is expressed thus: content(,Pos,Set) where is a string variable, Pos is a positive or negative (non-zero) integer representing Pos , the character from at position Pos from the start (or end if Pos is negative) of , and Set is a (non-empty) set of characters to which Pos may be bound, e.g. fa,tg. A position constraint restricts the absolute positions of a string variable on the input string and is expressed as start(,P) or end(,P) where is a string variable and P ranges over the 5 positive integers such that the rst (respectively last) character of is located at position P on the input string. A correlation constraint (\correlation" for short) denes the relation between the contents of two string variables. A correlation C has the following properties: It relates two string variables C (; ), the string variable being called the source, and the target. The length of the two string variables must be equal (due to equal numbers of symbols in the matching substrings), implying that there is an implicit length constraint between the two strings. There is a direction-component Cd, written as the relation Cd(; ). The two legal values for Cd are 1 and -1. 1(; ) is satised i (8i : 1 i h : i is related to i ). ?1(; ) is satised i (8i : 1 i h : i is related to h?i+1 ), where i and i are symbols from and , and h is the length of the matching substrings. Note that this means that all positions of the string variables take part in the correlation. There is a symbol-component Cs. As part of this component a function Cf is dened from to 2 . Cs (; ) is satised i (8i : 1 i h : i 2 Cf (i )) Let L be the language of all strings with symbols from . The correlation C (; ) is satised i 9x : x 2 L : Cd (; x) ^ Cs (x; ). Furthermore, we dene a notion of approximate matching, given as an argument to the appropriate correlation constraints. This argument ranges over the interval 0..100 and represents the percentage mismatch between two string variables; when the mismatch is zero then we can omit this argument. We can use Hamming distance [13], edit distance or more generally Levenshtein distance [15] in order to implement approximate matching1 . We dene id(, ) and reverse(, ) as general correlation constraints over all alphabets, where is the identity (respectively, reverse) of , and assume that there is a library of correlations, and that a user may add a new correlation to the library use a known correlation, or use (without storing in the library) an unnamed correlation in a specication. A correlation is thus dened by two arguments, the direction and the symbol component. For example the denition of the reverse complement for the DNA-alphabet is rev compl DNA(?1; fA ! fT g; C ! fGg; G ! fC g; T ! fAgg). Pre-dened correlations might be in a library. Example structure descriptions A description, or structure specication, of the stem loop using exact matching in Figure 1(iv) is :: , maxlength(, 4), length( , 1), content(, 1, fcg), rev compl RNA(, ) assuming a library denition of rev compl RNA as above, and where and form the stem, with the loop. A longer version without using the shorthand :: would be ^ ^ , maxlength(, 4), length( , 1), end start(, , 1), end start( , , 1), content(, 1, Minimum transformation costs calculated for: Hamming distance: substitution only, edit distance: insertion and deletion only, Levenshtein distance: substitution, deletion and insertion. 1 6 fcg), rev compl RNA(, , 0) User queries can be formulated in this \raw form" where an input string is appended to a structure description and some mapping algorithm used to map the description to the string. Thus the user may enter the following query: ::; maxlength(, 4), length( , 1), content(, 1, fcg), rev compl DNA(, ), tatacctgtcaggtata which will result in being mapped to the substring cctg starting at position 5 and ending at 8, to cagg starting at 10 and ending at 13, and to t at position 9. Queries may be optionally prefaced by a description of the alphabet of characters which are permitted in the input string. In order to improve the usability of the language we have dene a macro facility permitting the user to store and re-use denitions of, i.e. grammars for, specic structures. The syntax of this macro language is similar to that of logic programs; for example the following grammars dene languages for stem loops and pseudo knots: stemloop(, , )::: , rev compl RNA(, ). pseudoknot(, , ,):- :!1 ::!2 ::!3 : , rev compl RNA(, ), rev compl RNA( ,). Such descriptions can be parameterised by the lengths of the components of the structure, for example stemloop(, , , StemLength, LoopLength, Mismatch)::: , maxlength(,StemLength), length( , LoopLength), maxlength( ,StemLength), rev compl RNA(, , Mismatch) Note that in this description we are interested in nding the stem loops with the maximal possible length of the stem in order to avoid matching on to many substructures of that stemloop. We are not, however, interested in dening maxlength over the loop, since if we did so then we would possibly omit several dierent structures. For example, given the sequence ugcucaaaagagcuaaagagcu 1234567890123456789012 and an attempt to match with a denition stemloop(, , , StemLength, LoopLength, 0), 3 StemLength , StemLength 7, 1 LoopLength , LoopLength 20 7 two dierent stem loops should be identied, i.e. (1) gcucaaaagagc starting at 2 and ending at 13 where = gcuc, = aaaa, = gagc (2) gcucaaaagagcuaaagagc, starting at 2 and ending at 21 where = gcuc, = aaaagagcuaaa, = gagc 4 Implementation of the search engine using constraint logic programming 4.1 Language representation in clp(FD) We represent components, which we term here string variables by sequences with maximum length m of string-characters . These comprise pairs whose rst element Chars is a set of characters drawn from some alphabet A (of bases or nucleotides) and whose second element Pos is a set of integers in 1. . . m. Each pair represents the possible values of the characters to be found on the input string at the locations indicated by the second element of the pair. Moreover we assume that the successor relation holds between the second elements of neighbouring members of the sequence in the normally accepted direction of ordering. A (suitably constrained) string variable is thus schema for a structure, and can be instantiated by matching against an input string (see below). We have chosen constraint logic programming over nite domains [14] as a paradigm for implementation because of the declarative nature of our structure language and the use which it makes of nite domain constraints. In our implementation sequences are represented as lists, and thus string variables comprise lists whose elements are pairs of (Chars,Pos). We choose also to map alphabets onto (dense subsets of) natural numbers, so that for example for DNA we represent a, c, g, t by 1, 2, 3 and 4 respectively. In this way we can use any nite constraint logic programming language which does not permit operations over arbitrary nite domains. We have used clp(FD) [9] as the basis for our implementation because it has a specialised operation for complementation over genomic alphabets (see below). Moreover, the clp(FD) system is freely available, small in size and can compile to executable code. Ideally we would also like to be able to use a string solver, along the lines of [19], [12] or [16]. Length constraints are dened in the usual backtracking manner for lists although ideally we would like to use a list solver (for example [16]). Distance constraints are dened simply by referring to the position elements of character pairs: Content constraints are implemented by imposing constraints on the integer sets representing the characters using the sparse representation of nite domain variables in clp(FD) to describe non-continuous domains. Position constraints are straightforwardly implemented by constraining the position element of a stringcharacter pair. General correlation constraints (those independent of any alphabet) are coded in clp(FD) as follows. The id constraint constrains the corresponding characters in the string characters pairs to be equal. Note that the position elements in each corresponding pair are not constrained by this relation, since the string variables may be mapped to dierent places on the input string. 8 The reverse constraint rst of all reverses one of the string variables and then constrains it to be identical to the other string variable. Approximate matching between string variables is implemented using Hamming distance and relating this to the length of the list representing the string variable. Complementation constraints are implemented using a specialised solving routine compl/4 in clp(FD). For example RNA, whose alphabet a, c, g and u we represent by 1, 2, 3 and 4 respectively, has complements fa!fug, c!fgg, g!fc,ug, u!fa,ggg. We represent this by complement_char(Char1,Char2):compl(Char1,1,Char2,[4]), compl(Char1,2,Char2,[3]), compl(Char1,3,Char2,[2,4]), compl(Char1,4,Char2,[1,3]). where the denition of compl/4 is compl(A, Char, B, Chars):A=Char <=> Val1, B in Chars <=> Val2, Val1 in 0 .. max(Val2), Val2 in min(Val1) .. 1 4.2 The search engine The function of a processor for our language is to match a structure description on to an input string, in order to determine the contents and locations of those substrings of the input string which match the components of the description. Thus a solution to a mapping of a string expression onto an input string is a valuation (an assignment to each constraint variable in the string expression of one value from the domain of the variable) such that all the constraints are satised. Each element of all string-character pairs must be a singleton set satisfying the constraints on that element; an empty set indicates a failure to produce a solution. In our problem domain we are interested in producing all the solutions (mappings) possible of a given string expression onto an input string. An input string I comprises a sequence of characters drawn from some alphabet A (of bases or nucleotides); we limit the maximum length of any string to be less or equal to some maximum integer m. In order to perform mapping we rst convert the input string into a string-variable, i.e. a list whose elements are pairs of (Chars,Pos). For example the RNA sequence of agt of bases is converted to the list [(f1g,f1g),(f3g,f2g),(f4g,f3g)] using our numeric representation of the base alphabet. We have dened a naive procedure to map a specication Spec (i.e. a constrained string expression SE ) onto an input string I using backtracking. We assume two types of correlation: c (normal correlation) and r (reverse correlation), and a function p1: x y ! x. variables eciently. for each pair of string variables (; ) in SE correlated by correlation c do nd members of I s.t. 1 = Ij and 1 = Ik and set i = 1 9 while c(p1(i ); p1(i )) and i length() do i := i + 1 and j := j + 1 and k := k + 1 i = Ij and i = Ik end end for each pair of string variables (; ) in SE correlated by correlation r do set l = length( ) nd members of I s.t. 1 = Ij and l = Ik and set i1 = 1, i2 = l while c(p1(i1 ); p1(i2 )) and i1 length() do i1 := i1 + 1 and i2 := i2 ? 1 and j := j + 1 and k := k ? 1 i1 = Ij and i2 = Ik end end However, in the algorithm for the general case (including disjunction and negation) we do not do this pairwise mapping: proc map(SE ) if SE = A ^ B then do proc(A) and proc(B ) end if SE = A _ B then do proc(A) or proc(B ) end if SE = :A then do not proc(A) end if SE is a string variable then do nd a member of I s.t. 1 = Ij while i length( ) do i := i + 1 and j := j + 1 if i = Ij then true else fail end end end end Since our program is compiled to native code without an emulator or top-level query evaluator in clp(FD), we generate all the possible matches between a string variable and an input string by a failure-driven loop. This avoids the need to write our own query evaluator with interactive backtracking on user input of `;'. Moreover since our searches are computationally expensive we do not use setof/3 in order to collect solutions, and prefer to let the user the ability to abort the computation if he thinks that too many solutions are being produced, or too much time is being taken by the computation. 5 Interfacing the search engine to the WWW 5.1 General approach We have implemented a search engine based on our language using clp(FD), and have also produced a simple glass-teletype front-end which permits users to specify constraints on stem 10 loops in an interactive fashion. The program is based on the style of the denition of stemloop/6 in Section 3: stemloopRNA:stringvar(), stringvar( ), stringvar( ), start end(, ,1), start end( , ,1) write('Length of stem [Min,Max] '), read([MinS,MaxS]), write('Length of loop [Min,Max] '), read([MinL,MaxL]), write('Mismatch [Min,Max] '), read([MinM,MaxM]), StemLength #>= MinS, StemLength #<= MaxS, LoopLength #>= MinL , LoopLength #<= MaxL, Mismatch #>= MinM, Mismatch #<= MaxM, maxlength(,StemLength), length( , LoopLength), maxlength( ,StemLength), rev compl RNA(, , Mismatch), write('Input string: '), read(InString), make stringvar(InString, InStringVar), append all([, , ],SearchString), match(SearchString,InString), output([, , ]). Due to our implementation in a logic programming language, we permit the user to enter variables for requested values (except for the input string!) and let the logic computation attempt to generate those values. Obviously the more constrained by the input the string variable data structures are, the more ecient is the computation. 5.2 Why a Web interface? A Web interface proved an attractive proposition due to several factors: ease that which a user-friendly nature that could be provided, the freedom from multiple architecture considerations and the fact that system updates can be made available instantaneously: User interface The system interface as sketched above is not attractive to users since they usually want to make repeated queries with small changes in parameters; some kind of form-like interface would be ideal for this where user-entered data values are preserved between queries. The Web provides a very easy way to construct such an interface. Architectures The non-Web version has to be recompiled for dierent architectures/operating systems on which it can be run (and there will always be one user who has machine for which the clp(FD) compiler has not been ported. The Web version allows us to compile the program for one architecture only { our server. Testing At present the program is in a Beta-test stage; we need to get feedback quickly from potential users and would like to do this without having to physically install it on their systems. 11 Updates As the program is being updated rapidly, we would like to make these updates immediately available to users; this is really where the Web is the ideal mechanism to achieve this. Hence in order to make our program more accessible and to make the latest version of the search engine available we have constructed a user interface accessible via a Web browser capable of handling HTML forms. We have given a default query data set and input string in the form so that the naive user has an example query to experiment with. Interfacing to the search engine is achieved by using the PiLLoW libraries which permits the user to enter descriptions of the structures that he is interested in, to initiate a mapping operation and then will return the results of the mapping to the user. The queries are handled by a query evaluator, which checks query parameters, expands macros, and translates the queries into an internal form. This form is passed down to the constraint search engine which sets up the data structures, imposes the constraints on them and uses a matching algorithm to solve the constraints. Results of matching are output as the strings found and their locations of strings, and optionally the strings themselves; we plan to enhance the system with some graphical representation of the structures found. 5.3 Forms and CGI interface The general issues of making applications accessible using the Web are covered in various texts, see for example [6]; the PiLLoW system is described in [7] and contains a detailed description of methods for interfacing logic programming systems to the Internet/WWW. Briey, an Internet client can invoke a program on a server via a browser, for example, by sending the URL of the program to the server (as long as the program has a recognised extension, .cgi, and the right permissions are set). Such a program is called a CGI (Common Gateway Interface) program. Output from the invocation is returned to the client and must be in the form of an HTML page if it is to be interpreted by a browser. The main challenge is that of permitting data from the client to be sent to the server which is then accepted as input by the program on the server. Sending data from the client to the server-side program can be accomplished using HTML forms which permit the user to enter or select values for elds which may be text or numeric in type. There are two methods for actually sending this data: GET and POST. In the former the data is appended to the URL of the CGI program and is then put into an environment variable called for example QUERY STRING. The advantage of the GET method is its relative simplicity and the fact that the query information is visible at the client side as an addendum to the URL of the CGI program. The disadvantage of this method is that the environment can run out of space when a large amount of data is sent. In the POST method on the other hand some information is put into environment variables, for example the number of bytes of the actual data, and then the data is sent to the CGI program as standard input. This program must pick up from the environment the information about the length of the data contents since there is no distinguished character sent to indicate the end of the data stream, and then use the length to read the rest of the data byte by byte. The POST method is the best suited to situations when a relatively large amount of data is sent by the server, and our implementation makes use of this method 12 since we provide a specialised query engine and users provide both a query (in the form of the required parameters) and the input string { see Figure 2. The form can be found at http://www.soi.city.ac.uk/ drg/cgi-bin/struct-form.html, and sample databases at http://www.ebi.ac.uk/srs/srsc. The form comes pre-lled with data, and when reset will be re-lled with this data. Numeric eld may be left blank or lled in; the less constrained the query is, the longer the processing will take. A disadvantage with the CGI interface is that no client-side processing is performed, and thus failures can potentially will occur if, for example, the elds are lled in incorrectly. Our program checks the types and values of input data and returns a `failed query' message to the user if any data violation occurs. The alphabets that the program can process are: DNA: a,c,g,t with complements a-to-t, t-to-a, c-to-g, g-to-c, and RNA: a,c,g,u with complements a-to-u, c-to-g, g-to-c, g-to-u, u-to-a, u-to-g. Radio buttons are used to make the selection of the alphabet (default is DNA). The program will accept searches for RNA structures using an alphabet of a,c,g,t but will translate t to u and then use the RNA complementation. Output in this case is using the a,c,g,u alphabet. Structures are of the form |---------|------|---------| A B C where A and C are correlated regions of equal length and B is a `spacer'. Types of target structures that can be searched for are: Stem loop (default), Repeat, Inverse, with selection by radio button. These structures are based on the following correlations: identity giving repeat structures, reverse giving inverse structures (palindromes), complement giving stem loops when combined with reverse . Correlated regions are always of equal maxlength (measured in nucleotides), and may be specied within a minimum and maximum range, and within a percentage mismatch range based on Hamming's distance. The length of the spacer region, measured in nucleotides, may similarly be specied within a minimum and maximum range. as can the total length of the structure. This potential redundancy in information gives the user the freedom to omit some of the parameters as he sees t. The position on input string where the structure is to be searched for may be specied to start and end at either exact positions or within that range (i.e. as a window). The input string itself can be given with one or more lines each optionally prefaced by an integer indicating the start position. If the rst line is prefaced by an integer, then that value is taken to be the initial start position of the string, otherwise the default is that the string starts at position 1. Spaces are ignored in the input string. Output comprises a repetition of the query parameters and the input string, followed by those structures found (if any) in the form: Mismatch percent start=position , end=position start=position , end=position , correlated region (1) 13 Figure 2: Input form 14 start=position , end=position , start=position , end=position , spacer region correlated region (2) For example, the result of the query described in Figure 2 is given below: Structure search version $Revision: 1.55 $ alphabet=DNA struct=Stem loop correl_min=5 correl_max=7 mismatch_min=0 mismatch_max=0 spacer_min=24 spacer_max=26 struct_min=$empty struct_max=$empty pos_start=100 pos_end=400 Input string: 1 taattttaat caaatgaaaa aaaacaaagc ggtaatgaaa attgccgctt tttctttttg 61 agaaatatga cagtcaaaat cttacagatc aaaacctgat aacagtattt tctcagtcta 121 atttttgcgt attaatacaa tacgggattg cgtagataaa gtattatcaa aaaactaata 181 attttatgaa attaaataat tttttctatt gactattaaa gaatccggag taaattagtc 241 tccaaaatta accaaaacta ggtaatttat ccggtcaaag gttatcttaa gtattaaccc 301 taagaaaaag gaaaacgagt atgtccagta caggatatgc tccattttat ctccgtttta 361 ttcagttccc aagtaatgaa gttttactct atgaatactg gaaacttgtt cagaattttg 421 tacaaaaggt tagtaaaata acggtaagat tagcacaaat cgttggcatt ctcggcgaaa 481 aaactatttg gaaataccaa agtactttta atgatggcat gctggatatt gtggtttggt 541 tatcttattc aaaataaatt attaacaagg agatttaata tg Structures found: Stemloop Mismatch 0% start=105, end=138 start=105, end=109, start=110, end=133, start=134, end=138, gtatt ttctcagtctaatttttgcgtatt aatac Mismatch 0% start=169, end=204 start=169, end=173, start=174, end=199, start=200, end=204, aaaaa actaataattttatgaaattaaataa ttttt Mismatch 0% start=170, end=205 start=170, end=174, start=175, end=200, start=201, end=205, aaaaa ctaataattttatgaaattaaataat ttttt Mismatch 0% start=249, end=284 15 start=249, end=253, start=254, end=279, start=280, end=284, taacc aaaactaggtaatttatccggtcaaa ggtta Mismatch 0% start=169, end=205 start=169, end=174, start=175, end=199, start=200, end=205, aaaaaa ctaataattttatgaaattaaataa tttttt Mismatch 0% start=231, end=269 start=231, end=237, start=238, end=262, start=263, end=269, taaatta gtctccaaaattaaccaaaactagg taattta No (more) found 5.4 Using the PiLLoW library We use the PiLLoW library [7] to access the data sent by the client via the form; specically we use the get form input(InputList) which translates input from the form to a dictionary Dic of attribute=value pairs. It translates empty values (which indicate only the presence of an attribute) to `$empty', values with more than one line (from text areas or les) to a list of lines as strings, the rest to atoms or numbers. The get form value(Dic,Var,Val) rourine is then used to get the value for Val into Var and also we employ the text lines(Val,Lines) routine which transforms a value given by a dictionary to a list of lines, for data coming from a text area. 5.5 Porting the PiLLoW library to clp(FD) The PiLLoW library has been well-designed to make the task of interfacing logic programs to the Web easy. Our application required only a (signicant) subset of this library { the routines associated with accessing the data sent to a CGI program by a client program; however, the library had not been ported to clp(FD), and this we had to do. The rst problem in making the port was that the PiLLoW libraries make extensive use of denitions written in DCGs; clp(FD) does not have DCG expanders written into its clause reader. We got around this by reading in the libraries using SICStus Prolog, and then listing consulted programs to le; an unwelcome side-eect was that the comments and meaningful variable names were lost, as well as the code growing in size. Secondly, the code in the original PiLLoW libraries does not completely conform to the ISO Prolog standard [2, 8], whereas clp(FD) is compliant. Specically, we changed all occurrences of atom chars/2 and number chars/2 in the original code to atom codes/2 and number codes/2 respectively. The environment is accessed in a cleaner way in clp(FD) than in the original libraries, and 16 thus we were able to use the following routine: getenvstr(Name,Content):unix(getenv(Name,String)), atom_codes(String,Content). as opposed to the original getenvstr(Var,Val) :name(Var,VS), append("echo $",VS,SCommand), name(Command,SCommand), unix(popen(Command,read,S)), get_line(S,Val), Val = [_|_]. Finally, the PiLLoW system and clp(FD) are both module-based but, as one might expect, employ a dierent syntax, forcing us to modify the code accordingly. In summary, however, the PiLLoW libraries required few modications to make the port to clp(FD) { although our task would have been easier if we had had a manual or more complete documentation for the system. 5.6 Good software engineering practice Two varients of our system exist: one with a Web interface and the other with a simple teletype interface; both utilise the same search engine. We have made use of the module facility of clp(FD) in order to ensure that both varients use the same version of teh search engine and thus we keep the search engine code separate from the teletype interface and web-interface routines, allowing the engine to be linked to either interface. Moreover we use RCS, the Revision Control System, in order to be able to control the generation of versions and to be able to back out of a version if it proves to be awed. We have taken advantage of the ability of RCS to automatically number versions and to insert this information in the source code, thus enabling users to identify to us the version of the program which they are using for feedback purposes. 5.7 Testing Our search engine source program is 388 lines (10K) of clp(FD) code; we have compiled our program to 370K of stand-alone sun-sparc code using the clp(FD) system [9], and have used this to test the detection of stem-loops from a variety of databases, including entry with ID CXSTPLUC2 (accession number X87994) from the EMBL nucleotide sequence database release 49 (Nov 1996), URL: http://www2.no.embnet.org/srs/srsc?[EMBL-id:CXSTPLUC2]+-sf+GCG. For example our program took 40 ms on a Sun IPX to nd the stem-loop cccgtcca, gctcggct, tggacggg at position 20{43 (perfect matching), and 90 ms to nd the stem-loop cagctcg, gcttgga, cgggctg 17 at position 26{46 (mismatch of 14%) in a string of nucleotides from positions 1{60. More complete test results can be found in [10]. Readers can access the Web version of our program at http://www.soi.city.ac.uk/ drg/cgibin/struct-form.html 6 Summary We have described a web-based implementation in the nite domain constraint logic programming language clp(FD) of a web-based biosequence structure search engine. The engine is based on a a declarative language with constraints over distances (between strings in terms of nucleotides) and relations over a nite alphabet of nucleotides. Users specify the parameters of the structure which they wish to search for, and also provide an input string over which in the search is to be carried out. The search engine constructs a schema, or generalised structure, which is then matched against the input string and instances returned. These structures range from strings and regular expressions to more complex structures such as palindromes, repeats, stem loops and pseudo-knots. Limitations of the present implementation include the relatively small datasets which can be eciently handled using the POST method; we plan that in the future users can supply the URL of a biosequence database and that our implementation will then retrieve the input string itself using PiLLoW routines. The search engine uses a naive backtracking algorithm for matching, but despite its ineciencies we have tested our implementation on some real biological sequences with encouraging results. We are now in the process of making an object-oriented design for a CSP-based solver and implementing it in C++; we intend to interface this solver to the high-level implementation which has been made using constraint logic programming. A challenging task for the future will be to extend our program to search for structures in protein databases and to interface it to a plug-in visualiser such as Chime [1]. Acknowledgements We wish to thank Daniel Diaz, author of the clp(FD) package, for his help with designing some of the routines needed by our solver, and Manuel Hermenegildo and the other authors of the excellent PiLLoW package. This work has been carried out as part of a project nanced by the British Council and the Norwegian Research Council, which provided funding for the research visits. In addition, Inge Jonassen's research post is nanced by the Norwegian Research Council. References [1] Chemscape chime 1.0. http://www.mdli.com/chemscape/chime/. Netscape Navigator plug-in. [2] ISO/IEC 13211{1, Information Technology | Programming Languages | Prolog | Part 1: General Core, 1995. 18 [3] A. Bairoch, P. Bucher, and K. Hofman. The PROSITE database, its status in 1995. Nucleic Acids Research, 24(1):189{196, 1996. [4] L. Baranyi, W. Campell, K. Ohshima, S. Fujimoto, M. Boros, and H. Okada. The antisense homology box: A new motif within proteins that encodes biologically active peptides. Nature Medicine, 1(9):894{901, 1995. [5] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Technical Report TCU/CS/1995/18, Department of Computer Science, City University, 1995. Also Technical Report 113, Department of Informatics, University of Bergen, Bergen, Norway. [6] Steven E Brenner and Edwin Aoki. Introduction to CGI/PERL. M&T Books, 1996. [7] D. Cabeza, M. Hermenegildo, and S. Varma. The PiLLoW/CIAO Library for INTERNET/WWW Programming using Computational Logic Systems. In Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications, pages 72{90, JICSLP'96, Bonn, September 1996. Text and code available from http://www.clip.dia..upm.es/miscdocs/pillow/pillow.html. [8] P Deransart, A Ed-Dbali, and L Cervoni. Prolog: The Standard. Springer, 1996. [9] D. Diaz and P. Codognet. A Minimal Extension of the WAM for clp(FD). In David S. Warren, editor, Proceedings of the Tenth International Conference on Logic Programming, pages 774{790, Budapest, Hungary, 1993. The MIT Press. [10] I. Eidhammer, D. Gilbert, I. Jonassen, and M. Ratnayake. A constraint based structure description language for biosequences. Technical report 1997/04, Department of Computer Science, City University, UK and Department of Informatics, University of Bergen, Norway, 1997. [11] Ingvar Eidhammer, David Gilbert, Inge Jonassen, and Madu Ratnayake. A constraint based structure description language for biosequences. In submitted to CP97, 1997. [12] C. Gervet. Conjunto: constraint logic programming with nite set domains. In Maurice Bruynooghe, editor, Logic Programming - Proceedings of the 1994 International Symposium, pages 339{358, Massachusetts Institute of Technology, 1994. The MIT Press. [13] R. Hamming. Coding and Information Theory. Prentice Hall, Englewood Clis, NJ, 1982. [14] P. V. Hentenryck and Y. Deville. Operational semantics of constraint logic programming over nite domains. In J. Maluszynski and M. Wirsing, editors, PLILP91, number 528 in LNCS, pages 395{406. Springer-Verlag, aug 1991. [15] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii nauk SSSR (in Russian), 163(4):845{848, 1965. Also in Cybernetics and Control Theory, vol 10, no. 8, pp 707{710, 1996. [16] A. Rajasekar. Applications in constraint logic programming with strings. In Alan Borning, editor, PPCP'94: Second Workshop on Principles and Practice of Constraint Programming, Seattle WA, May 1994. [17] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjoelander, R. Underwood, and D. Haussler. Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res, 22:5112{5120, 1994. 19 [18] D. Searls. The Computational Linguistics of Biological Sequences. Tutorial at Third International Conference on Intelligent Systems for Molecular Biology, 1995. [19] C. Walinsky. CLP( ): Constraint logic programming with regular sets. In Giorgio Levi and Maurizio Martelli, editors, ICLP'89: Proceedings 6th International Conference on Logic Programming, pages 181{196, Lisbon, Portugal, June 1989. MIT Press. 20
© Copyright 2026 Paperzz