Hubs and Key Players

Network Mapping of
Large Data Sets
Al Ozonoff, Ph.D.
Joel Bernanke, M.Sc.
Boston University
School of Public Health
Network Analyses of Linked Data Sets
─ Yook (2002) developed network generators that captured
the Internet’s topology; postulated preferential
attachment and linear distance dependence.
Yook, S.-H., Jeong, H., & Barabasi, A.-L. 2002. Modeling the Internet’s largescale topology. PNAS, 99, 13382-13386.
─ Schwikowski (2000) built a protein-protein interaction
network in yeast to predict protein function.
Schwikowski, B., Uetz, P., & Fields, S. 2000. A network of protein-protein
interaction in yeast. Nature Biotechnology, 18 12, 1267-1261.
May 22, 2008
Interface - RISK : Reality
2
Networks in Public Health
─ Jones (2003) reported on power-law scaling in sexual
contact networks, relating the scaling coefficient to the
rate of disease transmission and the threat of epidemic.
Jones, J. H., & Handcock, M. S. 2003. An assessment of preferential
attachment as a mechanism for human sexual network formation. Proc. R.
Soc. Lond. B, 270, 1123-1128 .
─ De (2004) used network centrality measures to identify
key individuals in a gonorrhea outbreak.
De, P., Singh, A. E., Wong, T., Yacoub, W. & Jolly, A. M. 2004. Sexual network
analysis of gonorrhea outbreak. Sex Transm Infect, 80, 280-285.
May 22, 2008
Interface - RISK : Reality
3
Natural Mapping of a Data Set
When linkages are not predefined, suitable criteria for
identifying linkages must be developed.
We propose a natural mapping of a data set onto a
network: variables map to nodes and the associations
among variables map to edges
May 22, 2008
Interface - RISK : Reality
4
The NHANES Data Set
The National Health and Nutrition Examination Survey
(NHANES) assesses the health and nutritional status of
adults and children in the United States through
interviews and physical examinations.
The NHANES data set includes:
─ Demographics
─ Laboratory test results
─ Dietary records
─ Physiological measurements
─ General health information
May 22, 2008
Interface - RISK : Reality
5
Selecting Data to Map
A selected subset of continuous measures from all four of
the NHANES modules were included in the analysis.
Continuous measures with small numbers of observations
(< 20) were excluded.
Examples:
─ Age (years)
─ Blood titers
─ Number of green vegetables eaten per month
─ Cardiovascular stress test measurements
May 22, 2008
Interface - RISK : Reality
6
Generating a Correlation Matrix
We generated a correlation matrix that includes the
Spearman correlation between every variable and every
other variable.
All the correlations were converted to their absolute value.
We included correlations in in the matrix regardless of their
significance.
May 22, 2008
Interface - RISK : Reality
7
Mapping the NHANES Data Set
Variables were mapped to nodes.
Age (years)
0.6
Spearman correlations among the
variables were mapped to edges.
The exact correlation was either
retained as a measure of the strength
of an association or was
dichotomized (0, 1) based on a cutoff.
May 22, 2008
Interface - RISK : Reality
Body Mass Index
Age (years)
Cutoff = 0.7
Body Mass Index
8
Software
SAS 9.1 – Integrate NHANES data modules and generate
correlation matrix.
UUCINET – Convert correlation data to network data.
Netdraw – Visualize and analyze network data.
KeyPlayer – Identify key players.
May 22, 2008
Interface - RISK : Reality
9
Networks by Cutoff
Cutoff = 0.2
May 22, 2008
Cutoff = 0.5
Interface - RISK : Reality
Cutoff = 0.8
10
Distribution of Connections by Cutoff
Cutoff = 0.2
May 22, 2008
Cutoff = 0.5
Interface - RISK : Reality
Cutoff = 0.8
11
Degrees and Unlinked Nodes
Mean number of connections
per node (degree)
Percentage of unlinked nodes
(isolates)
Cutoff
May 22, 2008
Cutoff
Interface - RISK : Reality
12
Hubs and Key Players
Hubs – Nodes with many connections (edges).
Key Players – A set of N nodes that, in this case,
is maximally correlated with the rest of the
network.
May 22, 2008
Interface - RISK : Reality
13
10 Key Players
For the entire weighted network:
─ Age (years)
─ CD4 count (cells/mm3)
─ Urine creatinine (mg/dl) ─ CD8 count (cells/mm3)
─ Upper arm length (cm)
─ Alcohol fasting time (min)
─ Antacid / laxative fasting time (min)
─ Number of years taking insulin
─ How often wore hearing aid in the past year (number)
─ Lipid adjusted dioxin (pg/g)
May 22, 2008
Interface - RISK : Reality
14
Hubs and Key Players - Creatinine
Nodes with higher degrees
are larger.
The purpple squares are
the10 key players.
Notice that the key players
are not necessarily the
largest hubs.
May 22, 2008
Interface - RISK : Reality
15
Urine Creatinine Ego Network
Urine Elements
e.g. Molybdenum
Urine Creatinine
Urine Phthalates
Urine Phosphates
May 22, 2008
Interface - RISK : Reality
16
Hubs and Key Players – CD4, CD8
Nodes with higher degrees
are larger.
The blue squares are the
10 key players.
Notice that the key players
are not necessarily the
largest hubs.
May 22, 2008
Interface - RISK : Reality
17
CD4, CD8, and Immunotoxins
Isoflavones
CD-4 counts
CD-8 counts
PCBs
TCDDs
May 22, 2008
Interface - RISK : Reality
18
Conclusion
Future directions:
─ Further exploration of scale-free (power law) properties
of the NHANES data network.
─ Extend methodology to binary outcomes.
─ Account for negative correlations.
─ Investigate confounding.
─ Analyze additional data sets.
May 22, 2008
Interface - RISK : Reality
19
Network Terms
Node – a junction point.
Edge – a line connecting two nodes.
Degree – the number of edges a node has.
Hub – a node with many connections (edges).
Key players – a group of nodes who together are
connected to the maximum number of distinct nodes.
Power distribution – f(x) ~ x-γ
May 22, 2008
Interface - RISK : Reality
20
A Basic Undirected Network
Isolate – a node that is not connected
to the rest of the network.
Pendant – a node that is
connected to the rest of
the network by only
one edge.
May 22, 2008
Interface - RISK : Reality
21