Using graphs to relate expression data and protein

Using graphs to relate expression data and
protein-protein interaction data
R. Gentleman and D. Scholtens
April 25, 2017
Introduction
In Ge et al. (2001) the authors consider an interesting question. They assemble gene
expression data from a yeast cell-cycle experiment (Cho et al., 1998), literature proteinprotein interaction (PPI) data and yeast two-hybrid data. We have curated the data
slightly to make it simpler to carry out the analyses reported in Ge et al. (2001). In
particular we reduced the data to the 2885 genes that were common to all experiments.
We have also represented most of the data in terms of graphs (nodes and edges) since
that is the form we will use for most of our analyses.
The results reported in this document are based on version 0.23.0 of the yeastExpData
data package.
0.1
Graphs
The cell-cycle data is stored in ccyclered and from this it is easy to create a cluster graph. A cluster graph is simply a graph that is created from clustered data. In this
graph all genes in the same cluster have edges between them and there are no between
cluster edges.
> clusts = split(ccyclered[["Y.name"]],
+
ccyclered[["Cluster"]])
> cg1 = new("clusterGraph",
+
clusters = lapply(clusts, as.character))
> ccClust = connComp(cg1)
>
Next we are interested in looking for associations between the clusters in this graph
(genes that show similar patterns of expression) and the literature curated set of proteinprotein interactions.
1
We note that these are not really literature purported protein complexes (readers
might want to use the output of a procedure like that in apComplex applied to TAP
data to explore protein complex data). We also note that the data used to create these
graphs is already somewhat out of date and that there is new data available from MIPS.
We have chosen to continue with these data since they align closely with the report in
Ge et al. (2001).
>
>
>
>
data(litG)
ccLit = connectedComp(litG)
cclens = sapply(ccLit, length)
table(cclens)
cclens
1
2
2587
29
>
>
>
3
10
4
7
5
1
6
1
7
2
8
1
12
1
13
1
36
1
88
1
ccL2 = ccLit[cclens>1]
ccl2 = cclens[cclens>1]
We see that there are most of the proteins do not have edges to others and that there
are a few, rather large sets of connected proteins.
We can plot a few of those.
>
>
>
sG1 = subGraph(ccL2[[5]], litG)
sG2 = subGraph(ccL2[[1]], litG)
It is worth noting that the structure in Figures 1 and 2 is suggestive of collections of
proteins that form cohesive subgroups that work to achieve particular objectives.
An open problem is the development of algorithms (and ultimately software and
statistical models) capable of reliably identifying the constituent components. Among
the important aspects of such a decomposition is the notion that some proteins will be
involved in multiple complexes.
1
Testing Associations
The hypothesis presented in ? was to investigate a potential correlation between expression clusters and interaction clusters. The devised a strategy called the transcriptomeinteractome correlation mapping strategy to assess the level of dependence. Here we
reformulate their ideas (a more extensive discussion with some extensions is provided in
Balasubraminian et al. (2004)).
2
YNL289W
YLR313C
YPL031C
YOL039W
YPL140C
YDL179W YDL127W
YEL003W
YHR172W
YLR340W
YDR184C
YML094W
YLL021W
YMR109W
YNL126W
YLR212C
YLR200W
YDR388W
YDR382W
YLR319C
YDR356W
YOR181W YCL040W
YBR109C
YDR085C
YBL016W
YFL039C
YMR092C YAL029C
YKL042W
YOL016C
YCL027W
YPL242C
YDR103W
YNL243W
YHL007C
YER114C
YHR061C
YBR200W
YDR309C
YLR229C
YGR152C
YOR127W
YAL041W
YJL157C
YPL016W
YMR199W
YPR120C
YBR133C
YKL101W
YLR079W
YAL040C
YBR160W
YJL187C
YHR005C
YLL024C
YPR119W
YER111C
YGR108W
YJL194W
YGR109C
YPL256C
YAL005C
YBR155W
YPL240C
YLR216C
YML065W
YOR027W
YMR186W
YOR036W
YHR102W
YOR160W
YNL004W
YDR432W
YDR323C
YLR293C
YOR098C
YGL016W
YPL174C
YOL021C
YER009W
YBL079W
YMR308C
YHR069C
YHR129C
YMR294W
YKL068W
YOR185C
Figure 1: One of the larger PPI graphs
3
YOL139C
YGR162W
YDR180W
YDR386W
YMR117C
YDL043C
YDR485C
YML104C
YMR106C
YDL013W
YDL030W
YMR284W
YDL042C
YNL031C
YDR227W
YOR191W
YBR009C
YLR134W
YNL030W
YBR010W
YIL144W
YDL008W
YER179W
YNL172W
YLR127C
YDR146C
YDR118W
YBL084C
YBR073W
YER095W
YDR004W
YML032C
YNL312W
YDR076W
YAR007C
YJL173C
Figure 2: Another of the larger PPI connected components.
4
The basic idea is to represent the different sets of interactions (relationships) in terms
of graphs. Each graph is defined on the same set of nodes (the genes/proteins that are
common to all experiments) and the specific set of relationships are represented as edges
in the appropriate graph. Above we created one graph where the edges represent known
(literature based) interactions, litG and the second is based on clustered gene expression
data, cg1.
We can now determine how many edges are in common between the two graphs by
simply computing their intersection.
>
>
commonG = intersection(litG, cg1)
commonG
A graphNEL graph with undirected edges
Number of Nodes = 2885
Number of Edges = 42
>
>
And we can see that there are 42 edges in common. Since the literature graph has 315
and the gene expression graph has 156205 we might wonder whether 42 is remarkably
large or remarkably small.
One way to test this assumption is to generate an appropriate null distribution and
to compare the observed value (42) to the values from this distribution. While there
are some reasons to consider a random edge model (as proposed by Erdos and Renyi)
there are more compelling reasons to condition on the structure in the graph and to use
a node label permutation distribution. This is demonstrated below. Note that this does
take quite some time to run though.
>
+
+
+
+
+
+
+
+
+
>
>
>
>
>
ePerm = function (g1, g2, B=500) {
ans = rep(NA, B)
n1 = nodes(g1)
for(i in 1:B) {
nodes(g1) = sample(n1)
ans[i] = numEdges(intersection(g1, g2))
}
return(ans)
}
set.seed(123)
##takes a long time
#nPdist = ePerm(litG, cg1)
data(nPdist)
max(nPdist)
[1] 23
5
2
Data Analysis
Now that we have satisfied our testing curiosity we might want to start to carry out a
little exploratory data analysis. There are clearly some questions that are of interest.
They include the following:
Which of the expression clusters have intersections and with which of the literature
clusters?
Are there expression clusters that have a number of literature cluster edges going
between them (and hence suggesting that the expression clustering was too fine,
or that the genes involved in the literature cluster are not cell-cycle regulated).
Are there known cell-cycle regulated protein complexes and do the genes involved
tend to cluster together?
Is the expression behavior of genes that are involved in multiple literature clusters
(or at least that we suspect of being so involved) different from that of genes that
are involved in only one cluster?
Many of these questions require access to more information. For example, we need
to know more about the pattern of expression related to each of the gene expression
clusters so that we can try to interpret them better. We need to have more information
about the likely protein complexes from the literature data so that we can better identify
reasonably complete protein complexes (and given them, then identify those genes that
are involved in more than one complex). But, the most important fact to notice is
that all of the substantial calculations and computations (given the meta-data) are very
simple to put in graph theoretic terms.
References
R. Balasubraminian, T. LaFramboise, D. Scholtens, and R. Gentleman. Integrating
disparate sources of protein–protein interaction data. Technical report, Bioconductor,
2004.
R.C. Cho et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol.
Cell, 2:65–73, 1998.
H. Ge, Z. Liu, G. M. Church, and M. Vidal. Correlation between transcriptome and
interactome mapping data from saccharomyces cerevisiae. Nature Genetics, 29:482–
486, 2001.
6

Download Report

Using graphs to relate expression data and protein

Paperzz.com

Your Paperzz