Exercise: Cytoscape / visualization of networks

BNF079
2005
Exercise: Cytoscape / visualization of networks
Supervisor: Carl Troein
(046­222 3496, [email protected])
How can we represent a network on a computer?
For humans a drawing is fine, but for a computer an image is not very manageable for
computation. Here I am going to discuss a few of the formats found on the internet and used
in this course.
Terminology
Depending on the context and who is talking, several different words are used to describe the
same thing. Viewed from a computer science or methematics perspective, what we would call
a network is referred to as a graph. The links between the nodes of the network (which could
represent proteins, genes or something else) could be called connections, links or edges. What kind of information is needed?
From the simplest perspective, a network is simply a collection of links. So the minimum
information is a list of who links to whom. However, one often wants to know more. For
instance, how was the connection measured and how reliable do the experimentalists think the
link is? What kind of link is it? A protein–protein interaction or a regulatory link? In using the
networks one might also be interested in the functions of the genes, their expression patterns,
etc.
Tab delimited files
A tab is a special character, often produced by the tabulator key on the keyboard. When
printed, tabs typically look like variable amounts of space, causing whatever follows them to
be printed in neat columns. In perl the symbol used to represent a tab is \t. Because it is
distinct from a regular space (' ') and looks nice when viewed, the \t is often used to separate
the columns in a file from each other (although sometimes one uses space or comma instead).
When representing a network in a tab­delimited file, the minimum format is that each line
(ended by a newline character, \n) corresponds to one link. The first column then typically
contains the name of the node the link
starts from and the second column the
name of the node the link points to.
A
So this file:
A\tC\n
B\tC\n
corresponds to this network:
B
C
Advantages of tab delimited files:
•
•
Simple and easy to read.
Easy to create.
Drawbacks:
•
•
Not suitable for more complex relations.
Many possible formats that could even look the same
Cytoscape SIF files
In this course we use the program Cytoscape (http://www.cytoscape.org/) to visualize
networks. The name of the input files to cytoscape usually ends with .sif. These SIF (simple
interaction format) files are either tab­ or space­delimited, and have three columns on every
line. In the first column is the name of the node the link originates from. The second column
is the type of the link, e.g., protein–protein interaction or gene regulation, and in our examples
it is enclosed in (). The third column is the name of the node the link points to.
So the file
A (black) C\n
B (red) C\n
corresponds to this network:
One of your first tasks will be to create
such a file (or something similar if you
fancy) and view the output in Cytoscape.
A
C
B
If you save a network that you have been working with in Cytoscape, there is different format
called GML. In that format more information is saved about the nodes and edges, such as how
they're placed and colored.
Note in some downloadable SIF files the same line appears multiple times. If you want to
transform a tab delimited file with two columns (from­node and to­node) into a SIF file,
the following one liner in Perl might be useful:
perl -ne '@col = split(/\s/,$_);print "$col[0] link $col[1]\n";'
<tab_delim_filename >sif_filename
The same thing can be done even simpler with gawk:
gawk '{print $1" (link) "$2}' <tab_delim_filename >sif_filename
XML formats
XML (Extensible Markup Language) is a whole family of file formats, aptly described as
"tab­delimited text files on steroids". XML is related to HTML, in that both are derived from
the more general and horribly complicated SGML. There exists an XML version of HTML
called XHTML (which has the advantage of enforcing well­structuredness), and generally
XML is a great choice to build on for such complex data.
We are not using any XML format for representing network in this course, as we think the
learning curve is too steep, the potential benefits are too small, and the baseline effort for
using XML is large.
Advantages of XML
•
•
•
Very flexible for introducing further information.
Used in many different contexts (might be worthwhile to learn sometime).
Many tools for working with XML exist.
Disadvantages
•
•
A bit more difficult to parse even though there are subroutine/packages available in perl.
Somewhat less human­readable.
Data formats we use in Perl.
We have written some subroutines you are welcome to use. They use two different
representations:
List of links
Inspired by the Cytoscape sif files and tab delimited files, this is an array where each element
corresponds to a link. As a link is not just a single number the array element is technically a
reference to a hash: The hash has two keys:
'nodeF'
'nodeT'
The value of 'nodeF' is the node where the link starts (F stands for From). The value of
'nodeT' is the node that the link points to (T stands for To).
The network shown above is represented by
$link[0]{'nodeF'}='A';
$link[0]{'nodeT'}='C';
$link[1]{'nodeF'}='B';
$link[1]{'nodeT'}='C';
The major disadvantage of this data format is that it is cumbersome to find all the links going
to or from a specific node, as well as the total number of nodes and their names.
Adjacency matrix format
In the more mathematical litterature a network is often representated by a matrix called the
adjacency matrix. For those of you who knows matrices its matrix element Aij is 1 if a link
connects node j to node i, and 0 otherwise. The format we use corresponds to the columns of
this matrix plus a little bit more, and what we have is a hash (adja), whose keys are nodes of
the networks.
The value of a given node ('F') is a hash(reference), adja{'F'}, whose keys are the nodes node
'F' is connected to. (So the link starts at node 'F', if the link is directed). adja{'F'} is undefined
if node 'F' has no outgoing links. If node 'F' is connected to the node 'T' then the value of adja
{'F'}{'T'} is a hash(reference) containing link properties. You are not necessarily going to use
this hash. When reading a sif file with the subroutine read_sif_file the keys in this hash are the
types of links found in the sif file connecting 'F' to 'T' and the
values are the number of times this line appears in the file.
A
C
B
D
Example
After reading the sif file
A (black) C\n
B (red) C\n
B (black) D\n
The adja hash is as follows
adja{A}{C}{'black'} = 1
adja{B}{C}{'red'}
= 1
adja{B}{D}{'black'} = 1
Note that with this format it is still difficult to find all the links pointing towards a particular
node.
Network data to work with
Gene regulatory networks for Drosophila melanogaster and Saccharomyces cerevisiae can be
found at http://www.thep.lu.se/~carl/bnf079/networks/ .
There are several larger networks at
ftp://ftp.blueprint.org/pub/BIND/data/cytoscape/ .
Network downloading and Cytoscape excercise
Create your own little network sif file such that you can visualize it in Cytoscape.
Observe that you can drag the nodes around manually with the mouse.
Now it is time to put arrows on the links. In Cytoscape this is done by changing the visual
style. Create your own visual style and edit it such that the links get arrows.
Download at least one network from the internet and visualize it in Cytoscape.
The largest component of the network should contain at least 15 links.
Write where you got it from and what it is (plus where you found this information)
Visualize it in at least two different layouts.
Notice you can select a subnetwork with the mouse and one of the pull­down menus.
Download a transcriptional regulatory network for E. coli from
http://www.weizmann.ac.il/mcb/UriAlon/ . Parse it into a sif file. Visualize it in Cytoscape.