BNF079 2005 Exercise: Cytoscape / visualization of networks Supervisor: Carl Troein (046222 3496, [email protected]) How can we represent a network on a computer? For humans a drawing is fine, but for a computer an image is not very manageable for computation. Here I am going to discuss a few of the formats found on the internet and used in this course. Terminology Depending on the context and who is talking, several different words are used to describe the same thing. Viewed from a computer science or methematics perspective, what we would call a network is referred to as a graph. The links between the nodes of the network (which could represent proteins, genes or something else) could be called connections, links or edges. What kind of information is needed? From the simplest perspective, a network is simply a collection of links. So the minimum information is a list of who links to whom. However, one often wants to know more. For instance, how was the connection measured and how reliable do the experimentalists think the link is? What kind of link is it? A protein–protein interaction or a regulatory link? In using the networks one might also be interested in the functions of the genes, their expression patterns, etc. Tab delimited files A tab is a special character, often produced by the tabulator key on the keyboard. When printed, tabs typically look like variable amounts of space, causing whatever follows them to be printed in neat columns. In perl the symbol used to represent a tab is \t. Because it is distinct from a regular space (' ') and looks nice when viewed, the \t is often used to separate the columns in a file from each other (although sometimes one uses space or comma instead). When representing a network in a tabdelimited file, the minimum format is that each line (ended by a newline character, \n) corresponds to one link. The first column then typically contains the name of the node the link starts from and the second column the name of the node the link points to. A So this file: A\tC\n B\tC\n corresponds to this network: B C Advantages of tab delimited files: • • Simple and easy to read. Easy to create. Drawbacks: • • Not suitable for more complex relations. Many possible formats that could even look the same Cytoscape SIF files In this course we use the program Cytoscape (http://www.cytoscape.org/) to visualize networks. The name of the input files to cytoscape usually ends with .sif. These SIF (simple interaction format) files are either tab or spacedelimited, and have three columns on every line. In the first column is the name of the node the link originates from. The second column is the type of the link, e.g., protein–protein interaction or gene regulation, and in our examples it is enclosed in (). The third column is the name of the node the link points to. So the file A (black) C\n B (red) C\n corresponds to this network: One of your first tasks will be to create such a file (or something similar if you fancy) and view the output in Cytoscape. A C B If you save a network that you have been working with in Cytoscape, there is different format called GML. In that format more information is saved about the nodes and edges, such as how they're placed and colored. Note in some downloadable SIF files the same line appears multiple times. If you want to transform a tab delimited file with two columns (fromnode and tonode) into a SIF file, the following one liner in Perl might be useful: perl -ne '@col = split(/\s/,$_);print "$col[0] link $col[1]\n";' <tab_delim_filename >sif_filename The same thing can be done even simpler with gawk: gawk '{print $1" (link) "$2}' <tab_delim_filename >sif_filename XML formats XML (Extensible Markup Language) is a whole family of file formats, aptly described as "tabdelimited text files on steroids". XML is related to HTML, in that both are derived from the more general and horribly complicated SGML. There exists an XML version of HTML called XHTML (which has the advantage of enforcing wellstructuredness), and generally XML is a great choice to build on for such complex data. We are not using any XML format for representing network in this course, as we think the learning curve is too steep, the potential benefits are too small, and the baseline effort for using XML is large. Advantages of XML • • • Very flexible for introducing further information. Used in many different contexts (might be worthwhile to learn sometime). Many tools for working with XML exist. Disadvantages • • A bit more difficult to parse even though there are subroutine/packages available in perl. Somewhat less humanreadable. Data formats we use in Perl. We have written some subroutines you are welcome to use. They use two different representations: List of links Inspired by the Cytoscape sif files and tab delimited files, this is an array where each element corresponds to a link. As a link is not just a single number the array element is technically a reference to a hash: The hash has two keys: 'nodeF' 'nodeT' The value of 'nodeF' is the node where the link starts (F stands for From). The value of 'nodeT' is the node that the link points to (T stands for To). The network shown above is represented by $link[0]{'nodeF'}='A'; $link[0]{'nodeT'}='C'; $link[1]{'nodeF'}='B'; $link[1]{'nodeT'}='C'; The major disadvantage of this data format is that it is cumbersome to find all the links going to or from a specific node, as well as the total number of nodes and their names. Adjacency matrix format In the more mathematical litterature a network is often representated by a matrix called the adjacency matrix. For those of you who knows matrices its matrix element Aij is 1 if a link connects node j to node i, and 0 otherwise. The format we use corresponds to the columns of this matrix plus a little bit more, and what we have is a hash (adja), whose keys are nodes of the networks. The value of a given node ('F') is a hash(reference), adja{'F'}, whose keys are the nodes node 'F' is connected to. (So the link starts at node 'F', if the link is directed). adja{'F'} is undefined if node 'F' has no outgoing links. If node 'F' is connected to the node 'T' then the value of adja {'F'}{'T'} is a hash(reference) containing link properties. You are not necessarily going to use this hash. When reading a sif file with the subroutine read_sif_file the keys in this hash are the types of links found in the sif file connecting 'F' to 'T' and the values are the number of times this line appears in the file. A C B D Example After reading the sif file A (black) C\n B (red) C\n B (black) D\n The adja hash is as follows adja{A}{C}{'black'} = 1 adja{B}{C}{'red'} = 1 adja{B}{D}{'black'} = 1 Note that with this format it is still difficult to find all the links pointing towards a particular node. Network data to work with Gene regulatory networks for Drosophila melanogaster and Saccharomyces cerevisiae can be found at http://www.thep.lu.se/~carl/bnf079/networks/ . There are several larger networks at ftp://ftp.blueprint.org/pub/BIND/data/cytoscape/ . Network downloading and Cytoscape excercise Create your own little network sif file such that you can visualize it in Cytoscape. Observe that you can drag the nodes around manually with the mouse. Now it is time to put arrows on the links. In Cytoscape this is done by changing the visual style. Create your own visual style and edit it such that the links get arrows. Download at least one network from the internet and visualize it in Cytoscape. The largest component of the network should contain at least 15 links. Write where you got it from and what it is (plus where you found this information) Visualize it in at least two different layouts. Notice you can select a subnetwork with the mouse and one of the pulldown menus. Download a transcriptional regulatory network for E. coli from http://www.weizmann.ac.il/mcb/UriAlon/ . Parse it into a sif file. Visualize it in Cytoscape.
© Copyright 2026 Paperzz