Network Mapping of Large Data Sets Al Ozonoff, Ph.D. Joel Bernanke, M.Sc. Boston University School of Public Health Network Analyses of Linked Data Sets ─ Yook (2002) developed network generators that captured the Internet’s topology; postulated preferential attachment and linear distance dependence. Yook, S.-H., Jeong, H., & Barabasi, A.-L. 2002. Modeling the Internet’s largescale topology. PNAS, 99, 13382-13386. ─ Schwikowski (2000) built a protein-protein interaction network in yeast to predict protein function. Schwikowski, B., Uetz, P., & Fields, S. 2000. A network of protein-protein interaction in yeast. Nature Biotechnology, 18 12, 1267-1261. May 22, 2008 Interface - RISK : Reality 2 Networks in Public Health ─ Jones (2003) reported on power-law scaling in sexual contact networks, relating the scaling coefficient to the rate of disease transmission and the threat of epidemic. Jones, J. H., & Handcock, M. S. 2003. An assessment of preferential attachment as a mechanism for human sexual network formation. Proc. R. Soc. Lond. B, 270, 1123-1128 . ─ De (2004) used network centrality measures to identify key individuals in a gonorrhea outbreak. De, P., Singh, A. E., Wong, T., Yacoub, W. & Jolly, A. M. 2004. Sexual network analysis of gonorrhea outbreak. Sex Transm Infect, 80, 280-285. May 22, 2008 Interface - RISK : Reality 3 Natural Mapping of a Data Set When linkages are not predefined, suitable criteria for identifying linkages must be developed. We propose a natural mapping of a data set onto a network: variables map to nodes and the associations among variables map to edges May 22, 2008 Interface - RISK : Reality 4 The NHANES Data Set The National Health and Nutrition Examination Survey (NHANES) assesses the health and nutritional status of adults and children in the United States through interviews and physical examinations. The NHANES data set includes: ─ Demographics ─ Laboratory test results ─ Dietary records ─ Physiological measurements ─ General health information May 22, 2008 Interface - RISK : Reality 5 Selecting Data to Map A selected subset of continuous measures from all four of the NHANES modules were included in the analysis. Continuous measures with small numbers of observations (< 20) were excluded. Examples: ─ Age (years) ─ Blood titers ─ Number of green vegetables eaten per month ─ Cardiovascular stress test measurements May 22, 2008 Interface - RISK : Reality 6 Generating a Correlation Matrix We generated a correlation matrix that includes the Spearman correlation between every variable and every other variable. All the correlations were converted to their absolute value. We included correlations in in the matrix regardless of their significance. May 22, 2008 Interface - RISK : Reality 7 Mapping the NHANES Data Set Variables were mapped to nodes. Age (years) 0.6 Spearman correlations among the variables were mapped to edges. The exact correlation was either retained as a measure of the strength of an association or was dichotomized (0, 1) based on a cutoff. May 22, 2008 Interface - RISK : Reality Body Mass Index Age (years) Cutoff = 0.7 Body Mass Index 8 Software SAS 9.1 – Integrate NHANES data modules and generate correlation matrix. UUCINET – Convert correlation data to network data. Netdraw – Visualize and analyze network data. KeyPlayer – Identify key players. May 22, 2008 Interface - RISK : Reality 9 Networks by Cutoff Cutoff = 0.2 May 22, 2008 Cutoff = 0.5 Interface - RISK : Reality Cutoff = 0.8 10 Distribution of Connections by Cutoff Cutoff = 0.2 May 22, 2008 Cutoff = 0.5 Interface - RISK : Reality Cutoff = 0.8 11 Degrees and Unlinked Nodes Mean number of connections per node (degree) Percentage of unlinked nodes (isolates) Cutoff May 22, 2008 Cutoff Interface - RISK : Reality 12 Hubs and Key Players Hubs – Nodes with many connections (edges). Key Players – A set of N nodes that, in this case, is maximally correlated with the rest of the network. May 22, 2008 Interface - RISK : Reality 13 10 Key Players For the entire weighted network: ─ Age (years) ─ CD4 count (cells/mm3) ─ Urine creatinine (mg/dl) ─ CD8 count (cells/mm3) ─ Upper arm length (cm) ─ Alcohol fasting time (min) ─ Antacid / laxative fasting time (min) ─ Number of years taking insulin ─ How often wore hearing aid in the past year (number) ─ Lipid adjusted dioxin (pg/g) May 22, 2008 Interface - RISK : Reality 14 Hubs and Key Players - Creatinine Nodes with higher degrees are larger. The purpple squares are the10 key players. Notice that the key players are not necessarily the largest hubs. May 22, 2008 Interface - RISK : Reality 15 Urine Creatinine Ego Network Urine Elements e.g. Molybdenum Urine Creatinine Urine Phthalates Urine Phosphates May 22, 2008 Interface - RISK : Reality 16 Hubs and Key Players – CD4, CD8 Nodes with higher degrees are larger. The blue squares are the 10 key players. Notice that the key players are not necessarily the largest hubs. May 22, 2008 Interface - RISK : Reality 17 CD4, CD8, and Immunotoxins Isoflavones CD-4 counts CD-8 counts PCBs TCDDs May 22, 2008 Interface - RISK : Reality 18 Conclusion Future directions: ─ Further exploration of scale-free (power law) properties of the NHANES data network. ─ Extend methodology to binary outcomes. ─ Account for negative correlations. ─ Investigate confounding. ─ Analyze additional data sets. May 22, 2008 Interface - RISK : Reality 19 Network Terms Node – a junction point. Edge – a line connecting two nodes. Degree – the number of edges a node has. Hub – a node with many connections (edges). Key players – a group of nodes who together are connected to the maximum number of distinct nodes. Power distribution – f(x) ~ x-γ May 22, 2008 Interface - RISK : Reality 20 A Basic Undirected Network Isolate – a node that is not connected to the rest of the network. Pendant – a node that is connected to the rest of the network by only one edge. May 22, 2008 Interface - RISK : Reality 21
© Copyright 2026 Paperzz