StructWeb: Biosequence structure searching on the Web using clp

Computer Science
Technical Reports
Technical Report No. 1997/05
StructWeb: Biosequence structure
searching on the Web using clp(FD)
David Gilbert, Ingvar Eidhammer and Inge Jonassen
May 1997
ISSN 1364-4009
City University
Dept. of Computer Science
Northampton Square
London EC1V 0HB
United Kingdom
StructWeb: Biosequence structure searching on the Web using
clp(FD)
Ingvar Eidhammer
Department of Informatics
University of Bergen
Department of informatics
HIB
N-5020 Bergen Norway
[email protected]
Inge Jonassen
Department of Informatics
University of Bergen
Department of informatics
HIB
N-5020 Bergen Norway
[email protected]
David Gilbert
Department of Computer Science
Northampton Square,
London EC1V 0HB,
United Kingdom,
email: [email protected]
Abstract
We describe an implementation in a nite domain constraint logic programming language of a web-based biosequence structure searching program. We have used the clp(FD)
language for the implementation of our search engine and have ported the PiLLoW libraries to clp(FD). Our program is based on CBSDL, a constraint based structure description language for biosequences, and uses constrained descriptions to search for the
secondary structure of biosequences, such as tandem repeats, stem loops, palindromes and
pseudo-knots.
Keywords: constraints, biostructures, description language, searching, WWW interface.
1 Introduction
In this paper we report an implementation of a web-based biosequence structure searching
program. This implementation has been constructed using the nite domain constraint logic
programming language clp(FD) [9] together with our own port of the PiLLoW library [7] to
clp(FD). Our search engine is based on a CBSDL, a constraint based structure description
language for biosequences described in [11] and uses constrained descriptions to search for the
secondary structure of biosequences, such as tandem repeats, stem loops, palindromes and
pseudo-knots.
In this paper we describe the background to our structure searching language, the implementation of the search engine, issues encountered in porting the PiLLoW libraries to clp(FD) and
the Web interface that we have constructed for our search engine.
1
2 Biosequences and structures
2.1 Biological motivation
Biological macromolecules, DNA's, RNA's, and proteins, are chains of relatively small organic
molecules. The dierent types of these organic molecules are few { there are 4 dierent bases
for DNA's and RNA's and 20 dierent amino-acids for proteins. Conventionally alphabets
for DNA/RNA are lower case and protein in uppercase. A macromolecule can be coded as
a string over an alphabet of size 4 (for DNA/RNA), or 20 (for proteins) starting from one
end of the chain and moving towards the other. The strings for DNA/RNA molecules are
called nucleotide sequences , and each element in such a sequence is called a base . The strings
for protein molecules are called protein sequences , and each element in such a sequence is an
amino-acid (residue ). Collectively nucleotide and protein sequences, are called bio-sequences,
or just sequences . Sometimes we will also refer to them simply as strings.
Watson and Crick discovered in 1953 that DNA forms a double helix where a base in one strand
is bonded to a complementary base in the other strand (chain), and the so-called Watson-Crick
base pairs are a-t and g-c. The bases in RNA molecules can form bonds a-u and g-c in a similar
way. RNA and protein molecules fold into 3 dimensional structures enabling them to perform
their structural/functional role in the cell. The structures can be described at dierent levels.
For RNA molecules, the secondary structure is the collection of base pairs which are formed
in the folded molecule, and the tertiary structure is the complete 3 dimensional structure of
the folded molecule. as for RNA molecules.
An important problem in molecular biology is the prediction of the biological properties of a
macromolecule from its sequence, in particular the prediction of the structure and function of
an RNA molecule or a protein from its sequence. Proteins may be grouped into families where
the members of a family have similar structures. If the structure of one family member is
known, this helps in nding the structure of the other proteins in the family. Features that are
common to the sequences of the proteins in a family can be expressed in a pattern, and a new
sequence can be hypothesised to belong to the family if it ts the pattern. Most languages used
to dene patterns for protein sequences permit only the denition of what we call sequential
patterns. Sequential patterns give sucient expressive power to describe sequence features
that are characteristic for many protein families. This is illustrated by the PROSITE protein
family database which gives descriptive patterns for most of its families [3].
For describing patterns in RNA sequences, one needs to include dependencies between individual letters because the base-pairing interactions (most importantly a-u, g-c and g-u) play
a dominant role in determining RNA structure and function [17]. Figures 1 (i) and (ii) shows
two stem-loops (dened later) that might be structurally and functionally equivalent in RNA
molecules. It does not matter which bases (symbols) are present in the sequence in order for a
stem-loop to be formed, as long as the sequence contains two substrings of some minimum and
identical lengths and which are reverse complement of each-other. We will call such patterns
of dependencies structures in the sequences, and note that such patterns can be described using structural patterns as dened in Section 2.3 below. We can also describe other structures
found in RNA and DNA molecules such as clover-leafs and pseudo-knots. Finding a match to
a structural pattern in an RNA sequence does not imply that the corresponding molecule in
its native folded state will have the base pairing described by the pattern. It is believed that
the native structure will be one with minimum free energy, and another set of base pairings
than the one described by the pattern, might give a lower free energy.
2
c
a-u
g-c
u-a
c-g
augc ggcau
g
u-a
a-u
g-c
c-g
aggc ccgu
x
o-o
o-o
o-o
o-o
xx xx
x
o-o
o-o
o-o
c-o
xx xx
(i)
(ii)
(iii)
(iv)
Figure 1: Illustration of structures and structural patterns:
(i) and (ii) show two examples of
structures (stem loops) that might be equivalent in RNA molecules. Watson-Crick base pairing is
between a and u, and between g and c. Other base pairings are also possible. Figure (iii) shows a
possible representation of a pure structural pattern matching the structures (i) and (ii). The pattern
can also be called a consensus for the structures in (i) and (ii). The o and x symbols each match
any one nucleotide symbol, and pairs of o symbols which are connected with a dash (-) should match
pairs of symbols that can base pair. The x symbols are wildcards - one x matches any one symbol.
Figure (iv) shows a structural pattern equivalent to the pattern shown in (iii) except that the rst
nucleotide in the rst part of the stem has to be a c { a restriction on the content of the substrings to
match the pattern.
Structural patterns should not be used alone to predict the secondary structure of RNA,
but can be used in conjunction with structure prediction methods to provide hypothesises of
possible folds. This can be done eciently because matching a string against a structural
pattern is computationally cheap compared to structure prediction. Another advantage of
using structural patterns, is that they can be used to describe complex structures which
are not allowed when using dynamic programming based structure prediction. We postulate
that algorithms can be developed for nding conserved structural patterns in a set of RNA
sequences analogous to algorithms for nding conserved sequential patterns in sets of protein
sequences [5], and will investigate this in further work. In this way structural patterns allow
for description of potentially interesting conserved structures in sets of related biosequences.
Structures are also found in DNA sequences that can be described using structural patterns
but not using sequential patterns. This includes structures such as repeats and palindromes.
Repeats are abundant in genomic DNA, both in coding and in non-coding areas, and for
instance recognition sites for restriction enzymes are often palindromes.
2.2 Example structures
In the structure description below ; (with or without indices) are pattern components and
x is a wildcard (matching any one letter in an input string), r is the reverse of , and c is
the complement of . rc is the reverse complement of . We have identied the following
structures in the literature, see for example [18, 4]. For each type we give one example. All
examples are from DNA/RNA sequences, except for the last which is from a protein sequence.
3
Tandem repeat acgacg
Simple repeat
acgaaacg
Multiple repeat 1 acgaaacguuacg
Stem loop
rc
acgaacgu
Attenuator
rc 1 acgaacguauacg
Palindrome, even r
acggca
Palindrome, odd xr
acgagca
rc
rc
Pseudoknot
1 2 1 1 2 2 acgaaucugccguauaaga
Sense - antisense c
IVLSPANHK
More complicated structures can be obtained by combining the ones above, e.g. clover-leafs.
2.3 Constrained patterns in biosequences
The goal of our research underpinning the development of our search engine is to investigate
how constraint solving techniques can be used to search for structural patterns in sequences
(or strings) of symbols over a nite alphabet . The main motivation is searching in biological
sequences, and also in providing high-level descriptions of biosequence database contents, but
we believe that programs for searching for such patterns also might be useful in other areas
as well, e.g. signal processing or treating of acoustics data.
We dene a pattern as consisting of a logical expression on components and a set of unary
and binary constraints on the components where a component is a description of a string of
symbols. An input string S matches a pattern if for each component it contains a substring
matching that component, such that all the constraints are satised.
A pattern can contain constraints of the ve following types on the:
1. length of a substring to match a specic component,
2. distance (in the input string) between substrings to match the dierent components of
a pattern,
3. contents of a substring to match a component, e.g. the second symbol should be an a or
a t.
4. positions on the input string where a particular component can match,
5. correlation between two substrings matching dierent components, e.g. the substrings
should be identical, or the reverse of each other.
We also dene three associated classes of patterns:
Sequential: patterns which do not include a correlation constraint. The patterns in the
PROSITE data base [3] are examples of this class, for example [AC]-x(2,3)-D describing
a pattern comprising three components, the rst being an A or a C, the second of length
2 or 3 and the last consisting of a D.
Pure structural: patterns including at least one correlation constraint and no content
constraints. One example is repetition, where the substrings matching two dierent
4
components must be identical. Another example is a palindrome, two consecutive substrings of equal length must be the reverse of each other.
Structural: patterns having at least one correlation constraint and one content constraint.
One example is a palindrome, beginning with an a.
3 The structure language
A pattern in our language is dened by a structure specication , which is of the form
S; c1 ; : : : ; cn
where S is a string expression and c1 ; : : : ; cn is a set of constraints. The string expression
species the components, or string variables (denoted by the Greek letters ; ; ; : : : and
possibly subscripted), taking part in the pattern, and a logical expression on them using
conjunction, disjunction and negation. We follow the convention that
A set of constraints can contain constraints over the ve types: length, distance, content, position and correlation constraints, which are described below. In addition, we permit equality
and inequality operations over the integer components of the constraints, with the usual arithmetic operations over integers, addition, subtraction, multiplication and integer division. We
further allow the user to describe complex structures by conjoining structure descriptions.
A length constraint restricts the length of a string variable to be within a particular range,
and has the form length(, L) where is a string variable and L ranges over the positive
integers such that the length of is constrained to be within the range of L. We permit the
length of a string variable to be 0 in order to be able to describe null-strings.
Furthermore, we introduce two variants, maxlength(, L) and minlength(, L) such that the
length of is the maximum, respectively minimum value possible within the range denoted
by L according to some mapping to a given input string. Redundant matches are avoided in
the case of e.g. stem loops where substrings of the stem are not required.
A distance constraint restricts the distance between two string variables, and are specied
in a declarative and uniform way, e.g. start start(; ; D), end start(; ; D),
start end(; ; D), end end(; ; D) where and are string variables and D ranges over
the integers. These relations constrain the distance between the start of and start of (respectively end of and start of , start of and end of end of and end of ) to lie
within the range denoted by D. A negative value for D indicates that the point of reference of
occurs after the corresponding point of reference of in the input string. We also permit
the shorthand : to indicate that starts directly after . This shorthand is equivalent to
^ , end start(, , 1) .
A content constraint restricts which symbols can be in a specic position on a string variable
matching a component and is expressed thus: content(,Pos,Set) where is a string variable,
Pos is a positive or negative (non-zero) integer representing Pos , the character from at
position Pos from the start (or end if Pos is negative) of , and Set is a (non-empty) set of
characters to which Pos may be bound, e.g. fa,tg.
A position constraint restricts the absolute positions of a string variable on the input string
and is expressed as start(,P) or end(,P) where is a string variable and P ranges over the
5
positive integers such that the rst (respectively last) character of is located at position P
on the input string.
A correlation constraint (\correlation" for short) denes the relation between the contents
of two string variables. A correlation C has the following properties:
It relates two string variables C (; ), the string variable being called the source, and the target.
The length of the two string variables must be equal (due to equal numbers of symbols in
the matching substrings), implying that there is an implicit length constraint between the two
strings.
There is a direction-component Cd, written as the relation Cd(; ). The two legal values
for Cd are 1 and -1. 1(; ) is satised i (8i : 1 i h : i is related to i ). ?1(; ) is
satised i (8i : 1 i h : i is related to h?i+1 ), where i and i are symbols from and
, and h is the length of the matching substrings. Note that this means that all positions of
the string variables take part in the correlation.
There is a symbol-component Cs. As part of this component a function Cf is dened from
to 2 . Cs (; ) is satised i (8i : 1 i h : i 2 Cf (i ))
Let L be the language of all strings with symbols from . The correlation C (; ) is satised
i 9x : x 2 L : Cd (; x) ^ Cs (x; ).
Furthermore, we dene a notion of approximate matching, given as an argument to the appropriate correlation constraints. This argument ranges over the interval 0..100 and represents
the percentage mismatch between two string variables; when the mismatch is zero then we
can omit this argument. We can use Hamming distance [13], edit distance or more generally
Levenshtein distance [15] in order to implement approximate matching1 .
We dene id(, ) and reverse(, ) as general correlation constraints over all alphabets, where
is the identity (respectively, reverse) of , and assume that there is a library of correlations,
and that a user may add a new correlation to the library use a known correlation, or use
(without storing in the library) an unnamed correlation in a specication.
A correlation is thus dened by two arguments, the direction and the symbol component. For
example the denition of the reverse complement for the DNA-alphabet is
rev compl DNA(?1; fA ! fT g; C ! fGg; G ! fC g; T ! fAgg). Pre-dened correlations
might be in a library.
Example structure descriptions
A description, or structure specication, of the stem loop using exact matching in Figure 1(iv)
is
:: , maxlength(, 4), length( , 1), content(, 1, fcg), rev compl RNA(, )
assuming a library denition of rev compl RNA as above, and where and form the stem,
with the loop. A longer version without using the shorthand :: would be
^ ^ , maxlength(, 4), length( , 1), end start(, , 1), end start( , , 1), content(, 1,
Minimum transformation costs calculated for: Hamming distance: substitution only, edit distance: insertion
and deletion only, Levenshtein distance: substitution, deletion and insertion.
1
6
fcg), rev compl RNA(, , 0)
User queries can be formulated in this \raw form" where an input string is appended to a
structure description and some mapping algorithm used to map the description to the string.
Thus the user may enter the following query:
::; maxlength(, 4), length( , 1), content(, 1, fcg), rev compl DNA(, ), tatacctgtcaggtata
which will result in being mapped to the substring cctg starting at position 5 and ending
at 8, to cagg starting at 10 and ending at 13, and to t at position 9. Queries may be
optionally prefaced by a description of the alphabet of characters which are permitted in the
input string.
In order to improve the usability of the language we have dene a macro facility permitting
the user to store and re-use denitions of, i.e. grammars for, specic structures. The syntax of
this macro language is similar to that of logic programs; for example the following grammars
dene languages for stem loops and pseudo knots:
stemloop(, , )::: , rev compl RNA(, ).
pseudoknot(, , ,):-
:!1 ::!2 ::!3 : , rev compl RNA(, ), rev compl RNA( ,).
Such descriptions can be parameterised by the lengths of the components of the structure, for
example
stemloop(, , , StemLength, LoopLength, Mismatch)::: ,
maxlength(,StemLength),
length( , LoopLength),
maxlength( ,StemLength),
rev compl RNA(, , Mismatch)
Note that in this description we are interested in nding the stem loops with the maximal
possible length of the stem in order to avoid matching on to many substructures of that
stemloop. We are not, however, interested in dening maxlength over the loop, since if we did
so then we would possibly omit several dierent structures. For example, given the sequence
ugcucaaaagagcuaaagagcu
1234567890123456789012
and an attempt to match with a denition
stemloop(, , , StemLength, LoopLength, 0), 3 StemLength , StemLength 7,
1 LoopLength , LoopLength 20
7
two dierent stem loops should be identied, i.e.
(1) gcucaaaagagc starting at 2 and ending at 13 where = gcuc, = aaaa, = gagc
(2) gcucaaaagagcuaaagagc, starting at 2 and ending at 21 where = gcuc, = aaaagagcuaaa,
= gagc
4 Implementation of the search engine using constraint logic
programming
4.1 Language representation in clp(FD)
We represent components, which we term here string variables by sequences with maximum
length m of string-characters . These comprise pairs whose rst element Chars is a set of characters drawn from some alphabet A (of bases or nucleotides) and whose second element Pos is
a set of integers in 1. . . m. Each pair represents the possible values of the characters to be found
on the input string at the locations indicated by the second element of the pair. Moreover we
assume that the successor relation holds between the second elements of neighbouring members of the sequence in the normally accepted direction of ordering. A (suitably constrained)
string variable is thus schema for a structure, and can be instantiated by matching against an
input string (see below).
We have chosen constraint logic programming over nite domains [14] as a paradigm for
implementation because of the declarative nature of our structure language and the use which
it makes of nite domain constraints. In our implementation sequences are represented as lists,
and thus string variables comprise lists whose elements are pairs of (Chars,Pos). We choose
also to map alphabets onto (dense subsets of) natural numbers, so that for example for DNA
we represent a, c, g, t by 1, 2, 3 and 4 respectively. In this way we can use any nite constraint
logic programming language which does not permit operations over arbitrary nite domains.
We have used clp(FD) [9] as the basis for our implementation because it has a specialised
operation for complementation over genomic alphabets (see below). Moreover, the clp(FD)
system is freely available, small in size and can compile to executable code. Ideally we would
also like to be able to use a string solver, along the lines of [19], [12] or [16].
Length constraints are dened in the usual backtracking manner for lists although ideally we
would like to use a list solver (for example [16]). Distance constraints are dened simply by
referring to the position elements of character pairs: Content constraints are implemented by
imposing constraints on the integer sets representing the characters using the sparse representation of nite domain variables in clp(FD) to describe non-continuous domains. Position
constraints are straightforwardly implemented by constraining the position element of a stringcharacter pair.
General correlation constraints (those independent of any alphabet) are coded in clp(FD) as
follows.
The id constraint constrains the corresponding characters in the string characters pairs to
be equal. Note that the position elements in each corresponding pair are not constrained
by this relation, since the string variables may be mapped to dierent places on the input
string.
8
The reverse constraint rst of all reverses one of the string variables and then constrains
it to be identical to the other string variable.
Approximate matching between string variables is implemented using Hamming distance and
relating this to the length of the list representing the string variable.
Complementation constraints are implemented using a specialised solving routine compl/4 in
clp(FD). For example RNA, whose alphabet a, c, g and u we represent by 1, 2, 3 and 4 respectively, has complements fa!fug, c!fgg, g!fc,ug, u!fa,ggg. We represent this by
complement_char(Char1,Char2):compl(Char1,1,Char2,[4]), compl(Char1,2,Char2,[3]),
compl(Char1,3,Char2,[2,4]), compl(Char1,4,Char2,[1,3]).
where the denition of compl/4 is
compl(A, Char, B, Chars):A=Char <=> Val1, B in Chars <=> Val2,
Val1 in 0 .. max(Val2), Val2 in min(Val1) .. 1
4.2 The search engine
The function of a processor for our language is to match a structure description on to an input
string, in order to determine the contents and locations of those substrings of the input string
which match the components of the description. Thus a solution to a mapping of a string
expression onto an input string is a valuation (an assignment to each constraint variable in the
string expression of one value from the domain of the variable) such that all the constraints
are satised. Each element of all string-character pairs must be a singleton set satisfying the
constraints on that element; an empty set indicates a failure to produce a solution. In our
problem domain we are interested in producing all the solutions (mappings) possible of a given
string expression onto an input string.
An input string I comprises a sequence of characters drawn from some alphabet A (of bases or
nucleotides); we limit the maximum length of any string to be less or equal to some maximum
integer m. In order to perform mapping we rst convert the input string into a string-variable,
i.e. a list whose elements are pairs of (Chars,Pos). For example the RNA sequence of agt of
bases is converted to the list [(f1g,f1g),(f3g,f2g),(f4g,f3g)] using our numeric representation
of the base alphabet.
We have dened a naive procedure to map a specication Spec (i.e. a constrained string
expression SE ) onto an input string I using backtracking. We assume two types of correlation:
c (normal correlation) and r (reverse correlation), and a function p1: x y ! x. variables
eciently.
for each pair of string variables (; ) in SE correlated by correlation c do
nd members of I s.t. 1 = Ij and 1 = Ik and set i = 1
9
while c(p1(i ); p1(i )) and i length() do
i := i + 1 and j := j + 1 and k := k + 1
i = Ij and i = Ik
end
end
for each pair of string variables (; ) in SE correlated by correlation r do
set l = length( )
nd members of I s.t. 1 = Ij and l = Ik and set i1 = 1, i2 = l
while c(p1(i1 ); p1(i2 )) and i1 length() do
i1 := i1 + 1 and i2 := i2 ? 1 and j := j + 1 and k := k ? 1
i1 = Ij and i2 = Ik
end
end
However, in the algorithm for the general case (including disjunction and negation) we do not
do this pairwise mapping:
proc map(SE )
if SE = A ^ B then do proc(A) and proc(B ) end
if SE = A _ B then do proc(A) or proc(B ) end
if SE = :A then do not proc(A) end
if SE is a string variable then do
nd a member of I s.t. 1 = Ij
while i length( ) do
i := i + 1 and j := j + 1
if i = Ij then true else fail end
end
end
end
Since our program is compiled to native code without an emulator or top-level query evaluator
in clp(FD), we generate all the possible matches between a string variable and an input string
by a failure-driven loop. This avoids the need to write our own query evaluator with interactive
backtracking on user input of `;'. Moreover since our searches are computationally expensive
we do not use setof/3 in order to collect solutions, and prefer to let the user the ability to
abort the computation if he thinks that too many solutions are being produced, or too much
time is being taken by the computation.
5 Interfacing the search engine to the WWW
5.1 General approach
We have implemented a search engine based on our language using clp(FD), and have also
produced a simple glass-teletype front-end which permits users to specify constraints on stem
10
loops in an interactive fashion. The program is based on the style of the denition of stemloop/6 in Section 3:
stemloopRNA:stringvar(), stringvar( ), stringvar( ),
start end(, ,1), start end( , ,1)
write('Length of stem [Min,Max] '), read([MinS,MaxS]),
write('Length of loop [Min,Max] '), read([MinL,MaxL]),
write('Mismatch [Min,Max] '), read([MinM,MaxM]),
StemLength #>= MinS, StemLength #<= MaxS,
LoopLength #>= MinL , LoopLength #<= MaxL,
Mismatch #>= MinM, Mismatch #<= MaxM,
maxlength(,StemLength),
length( , LoopLength),
maxlength( ,StemLength),
rev compl RNA(, , Mismatch),
write('Input string: '), read(InString),
make stringvar(InString, InStringVar),
append all([, , ],SearchString),
match(SearchString,InString),
output([, , ]).
Due to our implementation in a logic programming language, we permit the user to enter
variables for requested values (except for the input string!) and let the logic computation
attempt to generate those values. Obviously the more constrained by the input the string
variable data structures are, the more ecient is the computation.
5.2 Why a Web interface?
A Web interface proved an attractive proposition due to several factors:
ease that which a user-friendly nature that could be provided, the freedom from multiple
architecture considerations and the fact that system updates can be made available instantaneously:
User interface The system interface as sketched above is not attractive to users since they
usually want to make repeated queries with small changes in parameters; some kind of
form-like interface would be ideal for this where user-entered data values are preserved
between queries. The Web provides a very easy way to construct such an interface.
Architectures The non-Web version has to be recompiled for dierent architectures/operating
systems on which it can be run (and there will always be one user who has machine for
which the clp(FD) compiler has not been ported. The Web version allows us to compile
the program for one architecture only { our server.
Testing At present the program is in a Beta-test stage; we need to get feedback quickly from
potential users and would like to do this without having to physically install it on their
systems.
11
Updates As the program is being updated rapidly, we would like to make these updates
immediately available to users; this is really where the Web is the ideal mechanism to
achieve this.
Hence in order to make our program more accessible and to make the latest version of the
search engine available we have constructed a user interface accessible via a Web browser
capable of handling HTML forms. We have given a default query data set and input string in
the form so that the naive user has an example query to experiment with.
Interfacing to the search engine is achieved by using the PiLLoW libraries which permits the
user to enter descriptions of the structures that he is interested in, to initiate a mapping
operation and then will return the results of the mapping to the user. The queries are handled
by a query evaluator, which checks query parameters, expands macros, and translates the
queries into an internal form. This form is passed down to the constraint search engine which
sets up the data structures, imposes the constraints on them and uses a matching algorithm to
solve the constraints. Results of matching are output as the strings found and their locations
of strings, and optionally the strings themselves; we plan to enhance the system with some
graphical representation of the structures found.
5.3 Forms and CGI interface
The general issues of making applications accessible using the Web are covered in various texts,
see for example [6]; the PiLLoW system is described in [7] and contains a detailed description
of methods for interfacing logic programming systems to the Internet/WWW.
Briey, an Internet client can invoke a program on a server via a browser, for example, by
sending the URL of the program to the server (as long as the program has a recognised
extension, .cgi, and the right permissions are set). Such a program is called a CGI (Common
Gateway Interface) program. Output from the invocation is returned to the client and must
be in the form of an HTML page if it is to be interpreted by a browser. The main challenge
is that of permitting data from the client to be sent to the server which is then accepted as
input by the program on the server. Sending data from the client to the server-side program
can be accomplished using HTML forms which permit the user to enter or select values for
elds which may be text or numeric in type.
There are two methods for actually sending this data: GET and POST. In the former the data
is appended to the URL of the CGI program and is then put into an environment variable
called for example QUERY STRING. The advantage of the GET method is its relative simplicity
and the fact that the query information is visible at the client side as an addendum to the
URL of the CGI program. The disadvantage of this method is that the environment can run
out of space when a large amount of data is sent.
In the POST method on the other hand some information is put into environment variables,
for example the number of bytes of the actual data, and then the data is sent to the CGI
program as standard input. This program must pick up from the environment the information about the length of the data contents since there is no distinguished character sent to
indicate the end of the data stream, and then use the length to read the rest of the data
byte by byte. The POST method is the best suited to situations when a relatively large
amount of data is sent by the server, and our implementation makes use of this method
12
since we provide a specialised query engine and users provide both a query (in the form of
the required parameters) and the input string { see Figure 2. The form can be found at
http://www.soi.city.ac.uk/ drg/cgi-bin/struct-form.html, and sample databases at
http://www.ebi.ac.uk/srs/srsc.
The form comes pre-lled with data, and when reset will be re-lled with this data. Numeric
eld may be left blank or lled in; the less constrained the query is, the longer the processing
will take. A disadvantage with the CGI interface is that no client-side processing is performed,
and thus failures can potentially will occur if, for example, the elds are lled in incorrectly.
Our program checks the types and values of input data and returns a `failed query' message
to the user if any data violation occurs.
The alphabets that the program can process are:
DNA: a,c,g,t with complements a-to-t, t-to-a, c-to-g, g-to-c, and
RNA: a,c,g,u with complements a-to-u, c-to-g, g-to-c, g-to-u, u-to-a, u-to-g.
Radio buttons are used to make the selection of the alphabet (default is DNA). The program
will accept searches for RNA structures using an alphabet of a,c,g,t but will translate t to u
and then use the RNA complementation. Output in this case is using the a,c,g,u alphabet.
Structures are of the form
|---------|------|---------|
A
B
C
where A and C are correlated regions of equal length and B is a `spacer'. Types of target
structures that can be searched for are: Stem loop (default), Repeat, Inverse, with selection
by radio button. These structures are based on the following correlations:
identity giving repeat structures,
reverse giving inverse structures (palindromes),
complement giving stem loops when combined with reverse .
Correlated regions are always of equal maxlength (measured in nucleotides), and may be
specied within a minimum and maximum range, and within a percentage mismatch range
based on Hamming's distance. The length of the spacer region, measured in nucleotides, may
similarly be specied within a minimum and maximum range. as can the total length of the
structure. This potential redundancy in information gives the user the freedom to omit some
of the parameters as he sees t.
The position on input string where the structure is to be searched for may be specied to start
and end at either exact positions or within that range (i.e. as a window). The input string
itself can be given with one or more lines each optionally prefaced by an integer indicating
the start position. If the rst line is prefaced by an integer, then that value is taken to be the
initial start position of the string, otherwise the default is that the string starts at position 1.
Spaces are ignored in the input string.
Output comprises a repetition of the query parameters and the input string, followed by those
structures found (if any) in the form:
Mismatch percent
start=position , end=position
start=position , end=position ,
correlated region (1)
13
Figure 2: Input form
14
start=position , end=position ,
start=position , end=position ,
spacer region
correlated region (2)
For example, the result of the query described in Figure 2 is given below:
Structure search version $Revision: 1.55 $
alphabet=DNA
struct=Stem loop
correl_min=5 correl_max=7
mismatch_min=0 mismatch_max=0
spacer_min=24 spacer_max=26
struct_min=$empty struct_max=$empty
pos_start=100 pos_end=400
Input string:
1 taattttaat caaatgaaaa aaaacaaagc ggtaatgaaa attgccgctt tttctttttg
61 agaaatatga cagtcaaaat cttacagatc aaaacctgat aacagtattt tctcagtcta
121 atttttgcgt attaatacaa tacgggattg cgtagataaa gtattatcaa aaaactaata
181 attttatgaa attaaataat tttttctatt gactattaaa gaatccggag taaattagtc
241 tccaaaatta accaaaacta ggtaatttat ccggtcaaag gttatcttaa gtattaaccc
301 taagaaaaag gaaaacgagt atgtccagta caggatatgc tccattttat ctccgtttta
361 ttcagttccc aagtaatgaa gttttactct atgaatactg gaaacttgtt cagaattttg
421 tacaaaaggt tagtaaaata acggtaagat tagcacaaat cgttggcatt ctcggcgaaa
481 aaactatttg gaaataccaa agtactttta atgatggcat gctggatatt gtggtttggt
541 tatcttattc aaaataaatt attaacaagg agatttaata tg
Structures found:
Stemloop
Mismatch 0%
start=105, end=138
start=105, end=109,
start=110, end=133,
start=134, end=138,
gtatt
ttctcagtctaatttttgcgtatt
aatac
Mismatch 0%
start=169, end=204
start=169, end=173,
start=174, end=199,
start=200, end=204,
aaaaa
actaataattttatgaaattaaataa
ttttt
Mismatch 0%
start=170, end=205
start=170, end=174,
start=175, end=200,
start=201, end=205,
aaaaa
ctaataattttatgaaattaaataat
ttttt
Mismatch 0%
start=249, end=284
15
start=249, end=253,
start=254, end=279,
start=280, end=284,
taacc
aaaactaggtaatttatccggtcaaa
ggtta
Mismatch 0%
start=169, end=205
start=169, end=174,
start=175, end=199,
start=200, end=205,
aaaaaa
ctaataattttatgaaattaaataa
tttttt
Mismatch 0%
start=231, end=269
start=231, end=237,
start=238, end=262,
start=263, end=269,
taaatta
gtctccaaaattaaccaaaactagg
taattta
No (more) found
5.4 Using the PiLLoW library
We use the PiLLoW library [7] to access the data sent by the client via the form; specically
we use the get form input(InputList) which translates input from the form to a dictionary Dic
of attribute=value pairs. It translates empty values (which indicate only the presence of an
attribute) to `$empty', values with more than one line (from text areas or les) to a list of
lines as strings, the rest to atoms or numbers. The get form value(Dic,Var,Val) rourine is then
used to get the value for Val into Var and also we employ the text lines(Val,Lines) routine
which transforms a value given by a dictionary to a list of lines, for data coming from a text
area.
5.5 Porting the PiLLoW library to clp(FD)
The PiLLoW library has been well-designed to make the task of interfacing logic programs to
the Web easy. Our application required only a (signicant) subset of this library { the routines
associated with accessing the data sent to a CGI program by a client program; however, the
library had not been ported to clp(FD), and this we had to do.
The rst problem in making the port was that the PiLLoW libraries make extensive use of
denitions written in DCGs; clp(FD) does not have DCG expanders written into its clause
reader. We got around this by reading in the libraries using SICStus Prolog, and then listing
consulted programs to le; an unwelcome side-eect was that the comments and meaningful
variable names were lost, as well as the code growing in size.
Secondly, the code in the original PiLLoW libraries does not completely conform to the ISO
Prolog standard [2, 8], whereas clp(FD) is compliant. Specically, we changed all occurrences
of atom chars/2 and number chars/2 in the original code to atom codes/2 and number codes/2
respectively.
The environment is accessed in a cleaner way in clp(FD) than in the original libraries, and
16
thus we were able to use the following routine:
getenvstr(Name,Content):unix(getenv(Name,String)),
atom_codes(String,Content).
as opposed to the original
getenvstr(Var,Val) :name(Var,VS),
append("echo $",VS,SCommand),
name(Command,SCommand),
unix(popen(Command,read,S)),
get_line(S,Val),
Val = [_|_].
Finally, the PiLLoW system and clp(FD) are both module-based but, as one might expect,
employ a dierent syntax, forcing us to modify the code accordingly.
In summary, however, the PiLLoW libraries required few modications to make the port to
clp(FD) { although our task would have been easier if we had had a manual or more complete
documentation for the system.
5.6 Good software engineering practice
Two varients of our system exist: one with a Web interface and the other with a simple teletype
interface; both utilise the same search engine. We have made use of the module facility of
clp(FD) in order to ensure that both varients use the same version of teh search engine and
thus we keep the search engine code separate from the teletype interface and web-interface
routines, allowing the engine to be linked to either interface. Moreover we use RCS, the
Revision Control System, in order to be able to control the generation of versions and to be
able to back out of a version if it proves to be awed. We have taken advantage of the ability of
RCS to automatically number versions and to insert this information in the source code, thus
enabling users to identify to us the version of the program which they are using for feedback
purposes.
5.7 Testing
Our search engine source program is 388 lines (10K) of clp(FD) code; we have compiled our
program to 370K of stand-alone sun-sparc code using the clp(FD) system [9], and have used
this to test the detection of stem-loops from a variety of databases, including entry with
ID CXSTPLUC2 (accession number X87994) from the EMBL nucleotide sequence database
release 49 (Nov 1996),
URL: http://www2.no.embnet.org/srs/srsc?[EMBL-id:CXSTPLUC2]+-sf+GCG. For example
our program took 40 ms on a Sun IPX to nd the stem-loop cccgtcca, gctcggct, tggacggg at
position 20{43 (perfect matching), and 90 ms to nd the stem-loop cagctcg, gcttgga, cgggctg
17
at position 26{46 (mismatch of 14%) in a string of nucleotides from positions 1{60. More
complete test results can be found in [10].
Readers can access the Web version of our program at http://www.soi.city.ac.uk/ drg/cgibin/struct-form.html
6 Summary
We have described a web-based implementation in the nite domain constraint logic programming language clp(FD) of a web-based biosequence structure search engine. The engine is
based on a a declarative language with constraints over distances (between strings in terms of
nucleotides) and relations over a nite alphabet of nucleotides. Users specify the parameters of
the structure which they wish to search for, and also provide an input string over which in the
search is to be carried out. The search engine constructs a schema, or generalised structure,
which is then matched against the input string and instances returned. These structures range
from strings and regular expressions to more complex structures such as palindromes, repeats,
stem loops and pseudo-knots. Limitations of the present implementation include the relatively
small datasets which can be eciently handled using the POST method; we plan that in the
future users can supply the URL of a biosequence database and that our implementation will
then retrieve the input string itself using PiLLoW routines.
The search engine uses a naive backtracking algorithm for matching, but despite its ineciencies we have tested our implementation on some real biological sequences with encouraging
results. We are now in the process of making an object-oriented design for a CSP-based solver
and implementing it in C++; we intend to interface this solver to the high-level implementation
which has been made using constraint logic programming.
A challenging task for the future will be to extend our program to search for structures in
protein databases and to interface it to a plug-in visualiser such as Chime [1].
Acknowledgements
We wish to thank Daniel Diaz, author of the clp(FD) package, for his help with designing some
of the routines needed by our solver, and Manuel Hermenegildo and the other authors of the
excellent PiLLoW package. This work has been carried out as part of a project nanced by the
British Council and the Norwegian Research Council, which provided funding for the research
visits. In addition, Inge Jonassen's research post is nanced by the Norwegian Research
Council.
References
[1] Chemscape chime 1.0. http://www.mdli.com/chemscape/chime/. Netscape Navigator
plug-in.
[2] ISO/IEC 13211{1, Information Technology | Programming Languages | Prolog | Part
1: General Core, 1995.
18
[3] A. Bairoch, P. Bucher, and K. Hofman. The PROSITE database, its status in 1995.
Nucleic Acids Research, 24(1):189{196, 1996.
[4] L. Baranyi, W. Campell, K. Ohshima, S. Fujimoto, M. Boros, and H. Okada. The antisense
homology box: A new motif within proteins that encodes biologically active peptides.
Nature Medicine, 1(9):894{901, 1995.
[5] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic
discovery of patterns in biosequences. Technical Report TCU/CS/1995/18, Department
of Computer Science, City University, 1995. Also Technical Report 113, Department of
Informatics, University of Bergen, Bergen, Norway.
[6] Steven E Brenner and Edwin Aoki. Introduction to CGI/PERL. M&T Books, 1996.
[7] D. Cabeza, M. Hermenegildo, and S. Varma. The PiLLoW/CIAO Library for INTERNET/WWW Programming using Computational Logic Systems. In Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications,
pages 72{90, JICSLP'96, Bonn, September 1996. Text and code available from
http://www.clip.dia..upm.es/miscdocs/pillow/pillow.html.
[8] P Deransart, A Ed-Dbali, and L Cervoni. Prolog: The Standard. Springer, 1996.
[9] D. Diaz and P. Codognet. A Minimal Extension of the WAM for clp(FD). In David S.
Warren, editor, Proceedings of the Tenth International Conference on Logic Programming,
pages 774{790, Budapest, Hungary, 1993. The MIT Press.
[10] I. Eidhammer, D. Gilbert, I. Jonassen, and M. Ratnayake. A constraint based structure
description language for biosequences. Technical report 1997/04, Department of Computer Science, City University, UK and Department of Informatics, University of Bergen,
Norway, 1997.
[11] Ingvar Eidhammer, David Gilbert, Inge Jonassen, and Madu Ratnayake. A constraint
based structure description language for biosequences. In submitted to CP97, 1997.
[12] C. Gervet. Conjunto: constraint logic programming with nite set domains. In Maurice
Bruynooghe, editor, Logic Programming - Proceedings of the 1994 International Symposium, pages 339{358, Massachusetts Institute of Technology, 1994. The MIT Press.
[13] R. Hamming. Coding and Information Theory. Prentice Hall, Englewood Clis, NJ, 1982.
[14] P. V. Hentenryck and Y. Deville. Operational semantics of constraint logic programming
over nite domains. In J. Maluszynski and M. Wirsing, editors, PLILP91, number 528 in
LNCS, pages 395{406. Springer-Verlag, aug 1991.
[15] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.
Doklady Akademii nauk SSSR (in Russian), 163(4):845{848, 1965. Also in Cybernetics
and Control Theory, vol 10, no. 8, pp 707{710, 1996.
[16] A. Rajasekar. Applications in constraint logic programming with strings. In Alan Borning,
editor, PPCP'94: Second Workshop on Principles and Practice of Constraint Programming, Seattle WA, May 1994.
[17] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjoelander, R. Underwood, and
D. Haussler. Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res,
22:5112{5120, 1994.
19
[18] D. Searls. The Computational Linguistics of Biological Sequences. Tutorial at Third
International Conference on Intelligent Systems for Molecular Biology, 1995.
[19] C. Walinsky. CLP( ): Constraint logic programming with regular sets. In Giorgio Levi
and Maurizio Martelli, editors, ICLP'89: Proceedings 6th International Conference on
Logic Programming, pages 181{196, Lisbon, Portugal, June 1989. MIT Press.
20

Download Report

StructWeb: Biosequence structure searching on the Web using clp

Paperzz.com

Your Paperzz