The Marginal Benefit of Monitor Placement on Networks

The Marginal Benefit of Monitor Placement on Networks
Benjamin Davis, Ralucca Gera∗ , Bing Yong Lim
Department of Applied Mathematics
Naval Postgraduate School, Monterey, CA
∗
Corresponding author: [email protected]
2
Gary Lazzaro
Department of Operations Research
Naval Postgraduate School, Monterey, CA
Erik C. Rye
Department of Computer Science
Naval Postgraduate School, Monterey, CA
December 16, 2015
Abstract
4
Inferring the structure of an unknown network is a difficult problem of interest to researchers, academics,
and industry. We develop a novel algorithm to infer nodes and edges in an unknown network. Our algorithm
utilizes sensors that have the capability to detect adjacent nodes and their labels, as well as and incident
edges to the monitor. The algorithm places new sensors at the highest degree neighboring node that has
been inferred, and it will restart when attempting to place a sensor at a node where a sensor already exists.
The algorithm has an adjustable restarting feature which varies between restarting at a previously discovered
but unmonitored node and a random teleportation to an unexplored node somewhere in the network. We
compare the inference performance of our algorithm against inference through random walks and random
placement in four distinct network - a Barabási-Albert network, an Erdős-Rényi random network, and two
data sets from the Stanford Large Network Dataset Collection [11]. Our algorithm outperforms random
walk inference and random placement of monitors type of inference in edge discovery in all test cases. Our
algorithm outperforms them in node inference in the synthetic data; in real data it outperforms them in the
beginning of the inference and we explain why. A website was created where these algorithms can be tested
live on preloaded networks or custom networks as desired by the user. The visualization also displays the
network as it is being inferred, and it provides other statistics about the real and inferred network. network
topology, distance between graphs (similarity), graph comparison metrics, distance in a graph.
2000 Math Subject Classification: 05C07, 05C12, 05C50
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
1
Introduction
The exploration of complex networks is a continuously evolving study as technology progresses and networks
change. In today’s world, there are many networks that are unknown. How do we gain insight into these
unknown networks without having to traverse every vertex and edge within the network? Is there a way to place
monitors at different areas of the network to gain this insight? The objective of this paper is to explore the
topic of monitor placement on network vertices in an attempt to gain insight into the true network topology.
1.1
Motivation
In this paper we assume no knowledge of the true network, except for a rough approximation of the number of
nodes and edges. Therefore, our inference of the networks will begin with choosing a random initial node that
a search algorithm will use to infer the network.
The algorithms used for network inference will be tested on different synthetic and real world complex
networks of same order. The test networks will be introduced in Table 1. Comparison of performance of an
algorithm amongst these different test networks will be normalized by looking at percentages, that is, the number
1
36
38
40
42
44
46
48
of inferred nodes divided by the approximate number of total nodes, or the number of inferred edges divided by
the approximate number of total edges.
In this paper we answer the following questions: As we increase the number of monitors placed up to fifty
percent of the nodes of the true network, what is the percent gain of new information inferred from the original
network? At what percentage of monitor placement does the discovery of inferred network information begins
to diminish towards a flat rate of change of the monitors discovered per monitor added? What is the minimum
percentage of monitors needed to discover all nodes?
The website http://faculty.nps.edu/rgera/projects.html [8] was created where these algorithms can be tested
live on preloaded networks or custom networks as desired by the user. The visualization also displays the
network as it is being inferred and that correlation to the percent edges and nodes inferred, and it provides
other statistics about the real and inferred network. Figure 1 shows two snapshots of the website, displaying
the networks as it is being inferred in green (top left), the leftover part of the network in white (top right), the
plot of edges and nodes inferred (bottom left), and a heat-map of accuracy at each step in the inference(bottom
right). Confidence intervals around the percent edges and nodes can be displayed by using multiple runs.
Figure 1: Two steps in the inference of an Erdős-Rényi network from [8]
1.2
50
52
54
56
58
60
62
64
66
68
70
72
Related Work
Inferring a network’s topology is an essential problem in understanding and studying complex networks, yet
known to be very challenging due to the large number of non-isomorphic graphs per given number of nodes. Our
general theoretical framework is applicable to arbitrary networks, and so in our research we study four different
types of networks to get a better feel for the performance of our scholarly ideas.
Inferring a network can be done with no knowledge of the network at all (other than some random starting
node), or with partial information collected from network devices (such as knowledge of some of the nodes
present in the network), or with complete information (in which case one could use the current knowledge
to further monitor the network, or to re-infer an evolving network). Bliss, Danforth and Dodds [4] present
recent techniques of inferring the topology of complex networks. These techniques are based on sampling nodes,
sampling edges, the exploration of networks using random walks, or snowball sampling based on chain referral
sampling ( [3, 10]). Of course, the most relevant question is measuring the inferred network against the true
network: random edge selection, depth and breath first search graph traversal, do not performed well overall;
simple uniform random node selection performs surprisingly well; best performing methods are the ones based
on random-walks starting at an arbitrary seed node (with the added probability of p at each node to teleport
out of the random walk to the seed node or another arbitrary node) [10]. High degree nodes play the important
role of hubs in communication and networking, and different local search strategies in power-law graphs and
that have costs scaling sublinearly with the size of the graph were introduced in [1]. However, the monitors in
this paper infer more than just the node and edge incident with it, and thus the techniques perform differently.
Most research on the inference of networks’ topology has been used on the inferring the Internet. It is impossible to obtain an accurate map of the Internet due to its size and constant evolution. Thus local measurements
at a small subset of the nodes is used, and then use this information to infer the actual graph. A combinatorial optimization for this problem was considered in [2]. The paper provided upper and lower bounds on the
competitive ratio and the approximation ratio in models. Haddadi et al. published a survey on the research on
Internet’s network topology over the past decade, studying techniques for inference, modeling, and generation
2
74
76
78
80
82
84
86
of the Internet topology at both the router and administrative levels, and presenting the mathematical models
assigned to various topologies [9] using traceroute. Since traceroutes are not always complete (internal routers
don’t have to cooperate by responding with their IPs), a new and reach research area of has developed, namely
network tomography. This assumes access to a few end-nodes in the network and the path of communication
between them in order to study internal characteristics of the Internet [21]. Malekzadeh and MacGregor [12]
present an overview of these techniques, particularly showing the two popular approaches: constructive algorithms that create binary tree topology by starting with leaves, and maximum-likelihood approaches based on
Bayesian estimator [7] and on Markov Chain Monte Carlo procedures [5, 6].
Other current techniques not necessarily using complex networks are based on differential equations given
one observation of one collective dynamical trajectory [17], statistical dependence between observations [20], as
well as machine learning based on frequency of small subgraphs [13]. Extensions to multilayered networks have
recently been published in [18] and [19]
2
Definitions of related work
92
We define our monitor to be able to see the node where it is placed, the edges incident to itself and its immediate
node neighbours, and it has the ability to detect the degree of its neighbors as well. We make no further
assumptions about knowledge of the network, except for a rough approximation of the number of vertices in the
network. These monitors can assign labels to the edges and the nodes that it detects, based on the labels of the
true topology inferred. That is, for example, if each of two monitors i and j will individually detect node k,
they will identify that it is the same k. We introduce this formally below.
94
Definition 2.1. We say that a monitor on node i detects a node j if (a) d(i, j) ≤ 1, and (b) i knows the label
of j and the deg j. A monitor on node i detects an edge ij if i and ij are incident.
88
90
v1
v2
v3
v4
v2
v3
v4
deg(v4 ) = 3
deg(v2 ) = 2
v6
v6
v5
deg(v6 ) = 3
Figure 2: A graph and a monitor placed at node v3
96
98
100
102
104
106
108
110
Notice that a monitor will always detect its closed neighborhood N [v], but it infers more than just its
neighbors. This is the idea used behind the domination number, γ(G), in graphs introduced by Ore [16]. We
say that a vertex dominates itself and its neighbors. Recall that a dominating set is a subset of the nodes such
that each vertex of V (G) is dominated by some vertex in the dominating set. The domination number is the
cardinality of a minimum dominating set of G. The domination number could definitely be used to monitor a
network, if the network is known. But in our approach of discovering the network, this will not be useful since
we do not assume to have much knowledge of the network.
The k-Vertex Maximum Domination, introduced by Miyano and Ono in [14], is the parameter that gives
the ideal placement of monitors if complete global network information. Given a positive integer k, k-Vertex
Maximum Domination (k-MaxVD ) finds a subset DN of the nodes with size k that maximizes the cardinality of
dominated nodes. That is, maximize ∪v∈DN N [v]. Note that this optimization may produce a dominating set for
some values of k, but it doesn’t have to since generally not all the nodes in the network are dominated. In [14],
the authors show that a simple greedy strategy achieves an approximation ratio of 1 − 1e for k-MaxVD, and this
approximation ratio is the best possible for k-MaxVD unless P = NP. We thus plot our inference algorithms
against a greedy approach as an upper bound, and a random placement and a random walk as a lower bound
on the performance of the algorithms. We refer to [15] for additional terminology not included in this paper.
3
112
Methodology
For this paper, we assume that the total number of nodes and edges can be reasonably estimated. However,
the configuration and layout of the nodes and edges are unknown until the monitors are placed near their 1-hop
3
114
proximity.
3.1
116
118
120
122
124
126
128
130
132
Placement Methodology
In this section we describe the approach that we use in placing monitors to infer an unknown network. The
formal algorithms can be found in Section 3.3. Our algorithm first picks an initial “seed” node at random. The
monitor is placed at that node. The monitor assigns a label to each node that is its neighbor and each edge
incident to the monitored node. Choose highest degree node neighbor and place another monitor. If there are
more than one neighbor of the highest degree, choose 1 of the highest degree nodes at random to place the next
monitor.
If the process attempts to place a monitor at a node where a monitor already exists, then a stopping condition
is reached. At the stopping condition, a check is performed for the number of nodes found versus the estimated
number of nodes. Recall that a rough estimate for the size of the network is an assumption in the motivation
of this problem. If the number of nodes found is less than the estimated number of nodes, another random
“seed” node is needed. The next “seed” node could be either a previously unseen node that is known to exist
(when p = 0), or a high degree node that was in a previous tie but not selected for monitor placement or the
next highest degree node that was previously skipped(when p = 1), or a combination of both approaches (when
0 < p < 1). If the total number of nodes have already been discovered but the overall budget for monitors has
not been reached, the next monitor will be placed at random when p = 0, or according to the passed over list
when p = 1. Each of these approaches for obtaining the next teleportation seed will be investigated, and their
performance will be compared (the probability to land on each undiscovered node is the the same).
3.2
Theoretical Preliminaries
We present an initial bound on the number of monitors needed for network inference based on our algorithm.
The best case is if the actual network topology is a star, which gives a lower bound of 1 since either the first
or second monitor would be placed at the center node. The center node, all arcs of the star, and all leaf nodes
would be identified and labeled. The worst case is if the actual topology was a path and the worse case scenario
for a path is if the first monitor is placed at a leaf node. In that case, it would take n − 1 monitors to actually
discover all n nodes of the graph. We thus have the following remark, and the bounds are sharp given by the
star and path described above. Therefore for any graph G on n vertices,
1 ≤ num monitors ≤ n − 1.
134
136
138
140
142
144
146
148
150
152
154
The biggest obstacle to efficiency associated with this algorithm is the potential need for many monitors
due to these bounds. An efficient design would be to completely infer a network with fewer monitors than the
sharp upper bound. However, in reality no more than 50% is needed, and so we will use an upper bound of n/2
vertices with monitors for a maximum amount of network discovery. This upper bound 50% on the number of
monitors available can be thought of as a budgetary constraint on the problem in terms of a limit on how much
of the network can be subject to monitoring.
3.3
Algorithm and Constraints
We create one hill-climbing algorithm starting at some random node of the network being investigated, with
a probabilistic restart. We consider the two extremes, when p = 0 each time and when p = 1 each time. We
further consider the variable p algorithm, and compare the three versions to a a best possible based on complete
information, as well as a random placement and random walk.
Given a constraint upon the number of monitors able to be placed in the graph, num monitors, we describe
our algorithm for graph inference below. Let N (v) be the open neighborhood of a vertex v ∈ V (G).
When Algorithm p = 0 terminates and the budget for total available monitors has not been reached, the
algorithm is restarted by teleporting to a randomly selecting another node of the network under investigation
that is not a part of the inferred graph for the placement of the next monitor (the probability to land on each
undiscovered node is the the same). In Algorithm p = 0, ties for the highest degree neighbor are broken by
randomly picking one of all highest degree neighbors for the placement of the next monitor. This mimics for
example surfing the web, after going on a path and deciding that enough information was discovered along that
idea, thus a restart on a new idea is needed.
Algorithm p = 1 is a refined implementation that avoids the use of teleportation for restarting the algorithm
whenever possible. Algorithm p = 1 attempts to restart itself by returning to a previously discovered vertex of
4
156
158
160
162
highest degree that was not chosen for monitor placement previously. If no such vertex exists, then Algorithm p =
1 must be given a previously unknown vertex to start again. Notice that in a disconnected graph, Algorithm p = 1
will discover fewer components as it restarts with previously discovered vertices in the same component.
Algorithm with variable p can be viewed as a combination of the first two approaches in terms of how
to restart the search algorithm when it terminates. Algorithm variable p exists to take advantage of both
approaches of teleportation and restarting from largest degree node unused as a monitor, as well as all the
intermediate choices.
Algorithm 1 HillClimb: High-degree neighbor with restart by teleportation or large seen degree
monitor ← randomly chosen from the entire network,
seen nodes list ← ∅
inf erred graph ← ∅
while 50% of the nodes in the network don’t have a monitor placed on them do
for monitor on the randomly chosen node in the complement of inf erred graph do
Add monitor to seen nodes list
Add all edges and nodes attached to the monitor to the inf erred graph
Add neighbors of the monitor that have not yet been discovered to the seen nodes list
highest deg node ← neighbor of monitor with highest degree
if highest deg node does not have a monitor then
monitor ← highest deg node
else if pick p a probability to restart then
if p = 0 then
monitor ← randomly chosen node from the complement of the inf erred graph
else if p = 1 then
monitor ← node with max degree in seen nodes list
else
monitor ← randomly chosen node from the complement of inf erred graph or a node with max
degree in seen nodes list
EndIf
EndFor
EndWhile
4
164
166
168
Results and Discussion
Table 1 presents general information regarding the four data sets that we will be using in more details in this
paper: One Erdős-Rényi (ER) network, one Barabási-Albert (BA) network, one Facebook (FB) network and
one General Relativity collaboration (GR) network. Each of these real networks was obtained from the Stanford
large network data set collection [11], and the two synthetic ones were created to be comparable. The node
count, edge count and number of components for varying values of p correspond to the inferred network at our
maximum budget for monitors of 50% of the total nodes of the true network.
5
Metrics
Node Count (True Network)
Edge Count (True Network)
Components (True Network)
Node Count (p = 0)
Edge Count (p = 0)
Components (p = 0)
Node Count (p = 1)
Edge Count (p = 1)
Components (p = 1)
Node Count (Variable p)
Edge Count (Variable p)
Components (Variable p)
GR
5242
14496
355
4793
11942
252
4070
13192
2
4640
13427
190
ER
5242
14496
19
5095
11804
7
5084
12455
1
5120
12719
1
BA
5242
15717
1
5199
13734
1
5241
15332
1
5232
14759
1
FB
4039
88234
1
4039
67864
1
4039
82927
1
4039
79823
1
Table 1: Overview of the data
170
172
174
176
178
Each algorithm was performed with 50 trials with different starting seed nodes. The performance of each
algorithm is shown and discussed for the average of the 50 trials. In particular, we explore the algorithm’s
limits of p = 0 (each time the algorithm wants to assign a monitor on a node that has a monitor on it, it will
teleport at a random node unoccupied by a monitor) and p = 1 (each time the algorithm wants to assign a
monitor on a node that has a monitor on it, it will restart at a previously seen node of largest degree that is
unoccupied by a monitor). We then explore variable value of p, namely p = .25, p = .5 and p = .75. We plot our
inference algorithms against the greedy approach with complete information as an upper bound based on [14]
(called Ideal shown in black), and two lower bounds shown in different shades of freesia representing Random
Placement (RP ) and Random Walk (RW ). The monitors type for all choices of Ideal, RW and RP are the
same as we introduced for our research.
180
4.1
General Relativity Network
182
The General Relativity network is comprised of 5242 nodes with 14496 edges. It is a professional collaboration
network where the nodes represent the authors and the edges represent authors who have published scholarly
articles about General Relativity together. In Figure 3 we compare our algorithms and follow it with a discussion.
Percent of nodes inferred from General Relativity
Network
Percent of edges inferred from General Relativity
Network
Figure 3: General Relativity Network: Percent of nodes & edges in the inferred graph
184
186
In the beginning of the inference our algorithms outperform the random placement and random walks as they
find their way to the hubs by placing monitors on high degree, which is the realistic use of monitor. However,
after a certain point, most of our algorithms are actually not optimal methods to infer the true topology of a
6
188
190
192
194
196
198
200
202
204
206
208
210
212
214
disconnected network as they are being bypassed by just a random placement of monitors that jumps around
from component to component. This shows that in the case of multiple components (355 in our case), one can
mostly rely on our inferences in the first quarter or so of the process (this will improve for connected networks
as we present later). As seen in Figure 3 for both edges and nodes inference, there is a higher rate of increase
of information discovered within roughly the first 10% of monitor placement. Furthermore, the rate of percent
nodes discovered begins to decrease above 35% to a constant rate of change as the inferences try to exhaustively
discover the components before discovering a new one. This is important to highlight because it suggests that
it will take much more than the already used 50% of the nodes to infer the remaining 10% of the true network
nodes from the monitor placement method with this algorithm; information needed for the decisions on monitor
placements. Moreover, within the percent of edges discovered plot, the rate of increase for edges discovered
begins to move towards a more linear trend at around 15% of node monitor placement. Just as in the percent
of nodes discovered graph, the monitor placement will take more than 50% of the nodes, arguably majority the
nodes, in order to discover the remaining 20%.
Among all the introduced algorithms, we get the best performance for disconnected networks when p = 0
as it has the most randomness allowing for the discovery of more components. In this case, using the average
of fifty trials the algorithm placed monitors on 2650 nodes (roughly 50% of the entire network), showing the
discovery of 4793 nodes (90.96% of the nodes) and 11942 edges (81.97% of the edges) when compared to the
known ground truth topology of the sample General Relativity network. In both plots, the percentage of inferred
information is significant at this point, but it takes the placement of monitors at half of the nodes due to the
large number of components and absence of hubs.
The other extreme, the graph inference trial using p = 1 generated a graph of 4070 nodes and 13192 edges.
As a percentage from the average of fifty trials, this algorithm captures 77.64% of the nodes and 91.00% of the
edges when compared to the known ground truth topology of the sample General Relativity network. When
p = 1, the discovery method is going to stay within the larger components finding more edges instead finding
all the leaf nodes and small components. This statement is further solidified since using p = 0 roughly finds 252
components of 355 components, while when p = 1 we only see the 2 largest components but in more details.
Based on this information, the algorithm with p = 1 will take many more monitors than 50% of the nodes,
arguably all of them, to find the remaining 23% of nodes and 9% of the edges.
216
218
220
222
224
226
228
230
232
234
236
238
240
At p = 0.25, the algorithm inferred 89.86% of the nodes, 90.55% of the edges, and roughly 228 components.
This makes sense since when a stopping criteria is reached, the algorithm selects a random node 25% of the time,
while 75% of the time the algorithm will select the last highest degree node passed over. This ensures that some
of the pendant components are not left out of the inferred topology as seen with 228 components discovered.
However, the number of components inferred continues to decrease as p increases. At p = 0.5, the algorithm
inferred 88.52% of the nodes, 92.63% of the edges, and found 190 components. As p increases, the number of
inferred nodes and components will decrease. At p = 0.75%, the algorithm inferred 85.62% of the nodes, 93.02%
of the edges, and found 126 components. The beauty of using the algorithm with variable p is that it will allow
the user to dictate what topology that they will want to extract from a given network. If the customer wants
to know about the information highways (the edges), then using a higher value for p will produce the highest
results. Additionally, if the customer wants to find information about hubs and communities, then a small value
of p will yield the needed information.
We now present a synthetic model with fewer componenets, moving on to connected graphs after that.
4.2
Erdős-Rényi Random Graphs
The particular Erdős-Rényi random graph generated for the purpose of the simulation to match the node and
edge cardinality of the General Relativity graph. The created graph comprises only of 19 components. This
potentially means that fewer monitors are needed compared to the General Relativity, since there are fewer
components to be inferred. Figure 4 shows the results of all our algorithms compared among each-other.
In the case of p = 0 due to the random restarts and disconnected components in the original graph, there
needed to be a large number of monitors to capture as many of the initial nodes as possible. The number of
connected components found with this algorithm is 7 of the existing 12. Thus the remaining 5 components
comprise of at most the remaining 2.8% of the nodes, thus they are small components.
Since our budget was 50% of nodes with monitors, placing them, i.e. placing 2650 monitors, will gather
information on 97.2% (5095) initial nodes, and 81.4% (11804) edges. We found that placing monitors on 20% of
the network’s nodes would allow 75% of the nodes to be discovered. This constitutes the steep part of the curve,
7
Nodes inferred from ER Network
Edges inferred from ER Network
Figure 4: ER Random Networks: Percent of Nodes and Edges in Inferred Graph
242
244
246
248
250
252
254
256
258
260
262
264
266
268
270
272
274
after which the curve tends to taper off to a plateau, and increasing the number of monitors do not provide
the same additional benefits as compared to placing a new one initially. However, the additional benefits of
adding more monitors allow the algorithm to continue finding information on more edges, in addition to finding
more nodes. This is because, at each given step of the algorithm the next monitor is picked as a highest degree
neighbor of the current monitor, and hence the nodes with high degrees were picked first for monitor placement.
While the algorithm results had shown that placing monitors at nearly half of the nodes could still not
capture 100% of the network, this could be attributed to the properties of random graphs (where nodes could
potentially tend to be scattered throughout the components without the existence of hubs). This reinforces
the observation that some of the nodes (particularly those residing in the small disconnected components) tend
to be neglected. Therefore, from a monitoring perspective, it could be feasible not to place monitors on these
remaining components and nodes as they are likely not carrying enough information to justify the need to place
a monitor there (since they are not that highly connected to other neighbours in the first place).
While there could be a budget constraint on the number of monitors to be placed on a network, one must
weigh the economic benefits of placing more monitors onto the network in order to capture the remaining minority of the nodes of the network. When p = 0, the algorithm introduces a maximum number of monitors to be
half of the nodes in order to reap the benefits of economies of scale. Placing monitors on the initial 20% could
capture 80% of the nodes, and that represents a fairly good investment on the number of monitors placed.
When p = 1, the number of nodes and edges found is the same, the red plot now even showing. However a
deeper analysys shows that when p = 1 we could find most of the nodes in a single large component, without
branching off to other disconnected components, which validates our intuition since the restarts happen from a
previously large seen degree node. Nodes of higher average degree were also found by Algorithm p = 1. A quick
comparison showed that 20% of the monitors when p = 0 found slightly less than 80% of the nodes, whilst 20%
of nodes found more than 80% of the nodes when p = 1. The rate of discovery is faster when p = 1.
Both results present an interesting option for the user. Depending on what the user desired: using p = 1 is
viewed as being more tailored for edges discovery and in depth component discovery, whilst p = 0 provides a
methodology to find all components more effectively.
With both extremities explored, the onus is to explore the in-between scenarios for different probabilities of
the algorithm. We explored three scenarios of different probabilities: p = 0.25, 0.5, 0.75. The results showed that
for variable p, the algorithm presented a slight increase in number of nodes and edges regardless of the value of
p. At 50% nodes with monitors and a variable p, the algorithm found 97.7% of the total nodes and 87.5% of
the total edges. The results showed that only one connected component is found, signifying that the majority
of the nodes and edges were contained in this big component.
8
276
4.3
Preferential Attachment model: Barabási Albert Networks
278
Our graph inference trial involving this sample Barabási Albert-model graph was performed on a network
consisting of 5242 nodes and 15717 edges, matching the number of nodes in the General Relativity example.
Nodes inferred from BA Network
Edges inferred from BA Network
Figure 5: BA Network: Percent of nodes and edges in the inferred graph
280
282
284
286
288
290
292
294
296
298
By construction, this network is connected, and it has hubs, unlike the General Relativity and the ErdősRényi Random Graphs. Due to the propensity of high degree hubs to form in the Barabási-Albert network
construction model, our algorithm captures a large percentage of the nodes in the ground truth topology using
relatively few monitors when regardless of the choice of p vales. This is evident in Figure 5, in which the inference
results for different values of p overlap the whole time, and they are very far away from the random placement
and random walk inferences. It is also quickly noticeable that the inference is pretty optimal, being so close to
the Ideal.
This effect is due to the hub nodes having high closeness centrality relative to the nodes that lie on the
periphery of the network. That is, regardless of the initial randomly-selected start node, a node of high degree
will be within a couple of steps from the seed, and it will be selected for monitor placement early in the
algorithm’s execution. Further, we can see a diminishing return on investment as the number of monitors placed
in the original graph increases. We see a similar effect with the edges discovered in the inferred graph, just
slower. This is due to the fact that many edges get discovered when monitors are placed on high degree nodes,
but since the hubs are close to each-other, the marginal benefit for edges decreases quickly after a couple of hubs
are used.
As we increase the number of monitors placed from zero up to 50% of the nodes in the true network, the
percent gain of new information gained per new monitor added quickly tends toward zero. In terms of nodes
discovered, the derivative of the function given by our our curve in Figure 5 decrease from a maximum of about
.5% marginal gain at the first fifty-monitor step to about .1% marginal gain nodes discovered when 20% of the
nodes in the graph are monitors.
4.4
300
302
304
306
308
Facebook Network
Recall the Facebook network used is very similar in size to the General Relativity from dataset [11], namely
it is a connected graph of 4039 nodes and 88234 edges. This network was chosen because it is similar in size,
connected with hubs, bringing a little bit of extra information to the previous three networks. The general
overview is displayed in Figure 6.
In therms of edges, very little variation exists between the algorithms, and our algorithms clearly outperform
the random walk and random placement ones. When we consider the nodes however, we have different plots.
Much like the Barabási-Albert, the hubs help the inference very quickly at first. However, the existence homophily and community level off the inference into step increases since the algorithm places monitors on nodes
that are high degree neighbors of each-other. Particularly, out of all the choices of p, the value of p = 0 performs
best due to the teleportation. Randomness allows the random placement inference to outperform our inference
9
Nodes inferred from FB Network
Edges inferred from FB Network
Figure 6: FB Network: Percent of nodes and edges in the inferred graph
310
312
314
316
318
320
322
the discovery of nodes after a certain point, but to get the complete picture of the network to be discovered,
edges need to be taken into account where our algorithms outperform randomness. This may be specific to this
network, thus we suggest to extend the analysis to other real neworks with homophily.
To see how randomness helps, consider only the vertices of degree greater than 250. Figure 7(a) shows the
entire network, and the plot of Figure 7(b) shows the graph induced on the vertices with degree greater than 250
which reveals 3 components. Consider these 3 components as communities for the entire Facebook network. ne
cluster is built around a hub that is by itself. The second cluster is built around a triangle of hubs. The third
cluster is centered around a star of hubs. Recall that since the entire Facebook graph is connected, we know
that these clusters must be connected to each other through lower degree vertices. Thus, interconnecting paths
between the clusters of the Facebook Network must contain an edge that is incident to vertices that have a low
degree (since these clusters are not connected). On the Facebook network, we discovered a vertex degree of 66
to connect different parts of the network. Thus, we discover the nodes as a step function since many high degree
nodes need to be used as monitors before getting to the lower degree node connecting the clusters.
Facebook: Vertices with degree above 250
Facebook: Overall Network
Figure 7: Facebook Network Visualizations
10
5
324
326
328
330
332
334
336
338
340
342
344
346
348
350
352
354
In this research we introduced a hill-climb algorithm that infers a network with no knowledge of the network
other than random nodes to start (or restart) the algorithm. Our algorithm has a probabilistic restart once
it wants to place a monitor on a node that is occupied by a monitor: when p = 1 the algorithm restarts at a
large degree node that has been discovered, versus when p = 0 the algorithm restarts at a random node of the
network, and there are all the choices in between for the variable 0 ≤ p ≤ 1 as expected. The value of p is chosen
before the algorithm starts.
We analyzed real-world networks (one social network and one co-authorship network) andd synthetic comparable networks (Erdős-Rényi and Barabási Albert Networks). Our analysis of 50 runs of the algorithm for
each value of p on each of the four networks concludes that there is very little difference between the variants of
p in the algorithm when we are concerned with edges being inferred. However, if the inference of nodes is the
main goal, it is interesting to see the clear difference between the real networks and the synthetic networks, as
well as the disconnected versus connected ones. On the synthetic networks, there is no difference between any of
our algorithms and they outperform the random placement and random walk, being extremely close to the ideal
case in the presence of hubs. On the real networks, we see a variance between the algorithms, our inferences
outperforming the random walks and placement towards the beginning of the inference, at which point random
placement performs better since it doesn’t choose nodes in the same clusters or component as our algorithms
do. As the analysis shows, fewer components we have, better our algorithms perform. This suggests that the
current algorithms should be used for quick inferences with a few monitors on connected networks, and they can
be improved by adding restarting conditions or switching the algorithms after a certain amount of monitors has
been placed. Also, on the real networks, we observed that if there are no teleportations, our algorithm infers
the denser part of the graph in more detail. Then the algorithm infers the hubs and the edges between hubs
ignoring the small degree vertices, particularly the ones that are not in the neighborhood of the hubs. If the
graph is disconnected with many components, the inference of the largest component is captured in detail, while
the other small components are not discovered at all.
A user that desires to infer an unknown complex network with this algorithm needs to have two estimates
about the unknown network to get the best performance. First, a user needs to know a rough estimate of the
size of the network. This estimate helps to define the budget of total monitors, which was set to n/2 in this
research. Second, the user needs to have a goal of inferring nodes or edges. If the inferred nodes are the goal,
then select a probability close to p = 0 to add randomness, and for edges edges select p = 1. The variable
parameter p of the algorithm combines the two different kinds of search methodologies, namely edge-finding or
node-finding allowing a better discovery of both nodes and edges on average.
6
356
358
360
362
364
366
Conclusion
Further Studies
We present a few directions for further research of this topic.
Open question 1: One possible extension for the detection algorithm would be to increase the capability of
the detection sensor. A limitation of the sensor examined in this paper is the ability for it to only detect incident
edges, neighboring vertices and the degree of those neighbors. Future work could consider a sensor that has the
capability to detect a triangle or nodes at further steps from the monitors.
Open question 2: Another possible improvement is to combine algorithms after a certain number of steps,
or to add restarts more often. This will avoid the step increases observed in the Facebook network due to the
clusters of hubs.
Open question 3: The biggest improvement that the authors see is finding a way of comparing the topology
of inferred networks to the true network, using other metrics besides the percent nodes and percent edges
discovered.
Acknowledgment
368
370
The authors would like to thank the DoD for partially sponsoring the current research. We would also like
to thank and acknowledge the Naval Postgraduate School’s Center for Educational Design, Development, and
Distribution (CED3) for creating the live visualization app [8] for this project.
11
References
372
[1] Lada A Adamic, Rajan M Lukose, Amit R Puniyani, and Bernardo A Huberman. Search in power-law
networks. Physical review E, 64(4):046135, 2001.
376
[2] Zuzana Beerliova, Felix Eberhard, Thomas Erlebach, Alexander Hall, Michael Hoffmann, Matús Mihal’ak,
and L Shankar Ram. Network discovery and verification. Selected Areas in Communications, IEEE Journal
on, 24(12):2168–2181, 2006.
378
[3] Patrick Biernacki and Dan Waldorf. Snowball sampling: Problems and techniques of chain referral sampling.
Sociological methods & research, 10(2):141–163, 1981.
380
[4] Catherine A Bliss, Christopher M Danforth, and Peter Sheridan Dodds. Estimation of global network
statistics from incomplete data. PloS one, 9(10):e108471, 2014.
374
382
[5] Mark Coates, Rui Castro, Robert Nowak, Manik Gadhiok, Ryan King, and Yolanda Tsang. Maximum
likelihood network topology identification from edge-based unicast measurements. In ACM SIGMETRICS
Performance Evaluation Review, volume 30, pages 11–20. ACM, 2002.
386
[6] Forrest W Crawford, Daniel Zelterman, Willem H Mulder, Marc A Suchard, Peter M Aronow, Alexander
Coppock, Donald P Green, Robert E Weiss, and Vladimir N Minin. The graphical structure of respondentdriven sampling. arXiv preprint arXiv:1406.0721, 2014.
388
[7] Nick G Duffield, Joseph Horowitz, F Lo Presti, and Don Towsley. Multicast topology inference from
measured end-to-end loss. Information Theory, IEEE Transactions on, 48(1):26–45, 2002.
390
[8] Ralucca Gera.
Network Discovery
http://faculty.nps.edu/rgera/projects, 2015.
392
[9] Hamed Haddadi, Miguel Rio, Gianluca Iannaccone, Andrew Moore, and Richard Mortier. Network topologies: inference, modeling, and generation. Communications Surveys & Tutorials, IEEE, 10(2):48–69, 2008.
394
[10] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2006.
396
[11] Jure Leskovec and Andrej Krevl.
SNAP Datasets:
http://snap.stanford.edu/data, June 2014.
384
398
400
402
Visualization:
An
analysis
of
network
discovery.
Stanford large network dataset collection.
[12] Amir Malekzadeh and Mike H MacGregor. Network topology inference from end-to-end unicast measurements. In Advanced Information Networking and Applications Workshops (WAINA), 2013 27th
International Conference on, pages 1101–1106. IEEE, 2013.
[13] Manuel Middendorf, Etay Ziv, and Chris H Wiggins. Inferring network mechanisms: the drosophila
melanogaster protein interaction network. Proceedings of the National Academy of Sciences of the United
States of America, 102(9):3192–3197, 2005.
404
[14] Eiji Miyano and Hirotaka Ono. Maximum domination problem. In Proceedings of the Seventeenth
Computing: The Australasian Theory Symposium-Volume 119, pages 55–62. Australian Computer Society,
Inc., 2011.
406
[15] Mark Newman. Networks: An Introduction. Oxford University Press, Inc., New York, NY, USA, 2010.
[16] O Ore. Theory of graphs. Am. Math. Soc. Colloq. Publ., 38, 1962.
408
410
412
414
[17] Srinivas Gorur Shandilya and Marc Timme. Inferring network topology from complex dynamics. New
Journal of Physics, 13(1):013004, 2011.
[18] Rajesh Sharma, Matteo Magnani, and Danilo Montesi. Missing data in multiplex networks: a preliminary
study. Third International Workshop on Complex Networks and their Applications,, 2014.
[19] Rajesh Sharma, Matteo Magnani, and Danilo Montesi. Investigating the types and effects of missing data
in multilayer networks. IEEE/ACM International Conference on Advances in Social Networks Analysis and
Mining, 2015.
12
416
418
[20] Kinh Tieu, Gerald Dalley, and W Eric L Grimson. Inference of non-overlapping camera network topology
by measuring statistical dependence. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International
Conference on, volume 2, pages 1842–1849. IEEE, 2005.
[21] Yehuda Vardi. Network tomography: Estimating source-destination traffic intensities from link data.
Journal of the American Statistical Association, 91(433):365–377, 1996.
13