BIOINFORMATICS APPLICATIONS NOTE Vol. 26 no. 3 2010, pages 432–433 doi:10.1093/bioinformatics/btp667 Systems biology DataPflex: a MATLAB-based tool for the manipulation and visualization of multidimensional datasets Bart S. Hendriks∗,† and Christopher W. Espelin Pfizer Research Technology Center, 620 Memorial Drive, Cambridge, MA 02139, USA Received on October 13, 2009; revised on November 14, 2009; accepted on November 27, 2009 Advance Access publication December 4, 2009 Associate Editor: Trey Ideker ABSTRACT Motivation: DataPflex is a MATLAB-based application that facilitates the manipulation and visualization of multidimensional datasets. The strength of DataPflex lies in the intuitive graphical user interface for the efficient incorporation, manipulation and visualization of highdimensional data that can be generated by multiplexed protein measurement platforms including, but not limited to Luminex or Meso-Scale Discovery. Such data can generally be represented in the form of multidimensional datasets [for example (time × stimulation × inhibitor × inhibitor concentration × cell type × measurement)]. For cases where measurements are made in a combinational fashion across multiple dimensions, there is a need for a tool to efficiently manipulate and reorganize such data for visualization. DataPflex accepts data consisting of up to five arbitrary dimensions in addition to a measurement dimension. Data are imported from a simple .xls format and can be exported to MATLAB or .xls. Data dimensions can be reordered, subdivided, merged, normalized and visualized in the form of collections of line graphs, bar graphs, surface plots, heatmaps, IC50’s and other custom plots. Open source implementation in MATLAB enables easy extension for custom plotting routines and integration with more sophisticated analysis tools. Availability: DataPflex is distributed under the GPL license (http://www.gnu.org/licenses/) together with documentation, source code and sample data files at: http://code.google.com/p/datapflex. Contact: DataPfl[email protected]; bhendriks@merrimackpharma. com Supplementary information: Supplementary data available at Bioinformatics online. 1 INTRODUCTION The scale of protein-based measurements is rapidly increasing with the advent of so-called ‘multiplexing’ technologies for antibodybased measurement offered by platforms such as Luminex and Meso-Scale Discovery. These technologies are readily amenable to intricate experimental layouts which simultaneously investigate multiple experimental conditions, treatments and time points simultaneously. Examples of such complex and large-scale proteinbased datasets are starting to appear in the literature (Cosgrove et al., ∗ To whom correspondence should be addressed. address: Merrimack Pharmaceuticals, 1 Kendall Square, Cambridge, MA 02139, USA. † Present 432 2009a,b; C.W.Espelin et al., manuscript in preparation; Gaudet et al., 2005; Mitsos et al., 2009; Pritchard et al., 2009; Samaga et al., 2009). In Espelin et al., for example, 73 protein measurements distributed across 12 time points, 15 stimulation conditions, 11 inhibitors at five concentrations in the U937 cell line create a rich multidimensional hypercube of data. For such datasets (in excess of three dimensions), common plotting tools are woefully inadequate for rapid reordering, manipulation and visualization of data. Cell-based experiments can be described in terms of their underlying dimensions. For example: (cell type × stimulus × compound × compound concentration × time × measurement) or (stimulus × stimulus concentration × cell type × medium conditions × compound × measurement). Plotting this kind of data using a package such as Excel is extremely time consuming, requiring the generation of plots one at a time and does not enable the rapid reorganization of data to generate plots as a function of time or concentration, for example. When plotting datasets of this nature, it is desirable to (i) be able to view the hypercube of data from different angles (rapidly reorder the dimensions), (ii) enable basic dataset manipulation, (iii) offer multiple means of data normalization, (iv) easily view subportions of the entire dataset, (v) offer multiple types of plots; and (vi) enable future development and integration with other analysis tools. DataPflex was developed to address these needs, with careful recognition that an efficient and intuitive interface would be critical for acceptance and usage by the bench biologists who generate such datasets. DataPflex fills an important niche between other software platforms, such as Spotfire and DataRail (Saez-Rodriguez et al., 2008), with its combined capabilities for handling multidimensional data, creating multiple plots simultaneously and ease of use. 2 METHODS AND IMPLEMENTATION DataPflex was written entirely in MATLAB R2007b (The Mathworks, Natick, MA, USA; the graphical user interface has known incompatibility with prior MATLAB releases). All experimental data must be categorized into six dimensions: (five arbitrary dimensions and a measurement dimension). Each dimension can be either textual (inhibitor names, for example) or numeric (time points or concentrations, for example) in nature. For dataset in excess of six dimensions, it is recommended that such datasets be split into multiple input files. Data is imported from an Excel (.xls) file, consisting of two blocks separated by an empty column. The first block is a description block with up to five dimension headings of the form ‘dimension name’ or ‘dimension name = dimension units’. For cases where there are pairs of corresponding © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] DataPflex Table 1. Example Excel (.xls) data format #1 for import into DataPflex (individual replicates) Cell Type Stimuli Time=min Inhibitor Inhibitor Concentration=uM [Data Description Block] Fig. 1. DataPflex screenshot. dimensions (inhibitor and inhibitor concentration, for example), dimension headings should be [name] and [name] concentration = units, where [name] is the name of the dimension (inhibitor and inhibitor concentration = uM, for example). These column headings may be in any order and the description block may contain an arbitrary number of additional columns with empty headings. The second block is the measurement data, separated from the description block by an empty column. The data block can be in one of two formats: (Table 1) with column headings correspond to the measurements performed or (Supplementary Material) with column heading corresponding to the mean and standard deviations for each measurement. In all column headings, the unit description (=units) is optional. The first row of the file must contain the headings and subsequent rows contain corresponding experimental descriptions and data. For import format #1 (Table 1) replicates are entered as in separate rows and automatically averaged upon import with SDs calculated. A sample dataset from Espelin et al. (C.W.Espelin et al., manuscript in preparation) is provided in the Excel import format. Following import, the six dimensions are shown within the DataPflex interface and can be efficiently and easily reordered with drop-down menus and subportions selected in each window (Fig. 1). Individual labels can be renamed and reordered within a given dimension. Multiple normalization options are offered for plotting, including raw data, normalization by the max value, log normalization, normalization relative to conditions/time courses within the dataset and several others. Additionally, the data corresponding to ‘No Treatment’ can be automatically added to each plot (bottom left of GUI, Fig. 1). Allowing, for example, the comparison of a treatment condition with background. Once the dimensions and data normalization have been selected, several plot types may be generated: line graphs, scatter plots, bar graphs, 3-D surface plots, mean/stdev, IC50’s, heatmaps and arrow plots. Mean/Stdev plots display the mean and standard deviation of each selected measurement in the dataset in order to obtain a general view of which species in the dataset are showing significant variability. Additional options exist for each plot type: (i) data can be clustered by row and/or column; and (ii) additional dimensions can be ‘unfolded’ (i.e. concatenated) into either the rows or columns. For heatmaps, when the third dimension is a numeric dimension, such as time or concentration, the average Measurement #1=units Measurement #2=units … [Experimental Data Block] value is plotted and the figure can be clicked on to visualize the underlying time series data. Line arrow plots contain the same information as heatmaps, but are presented in the form of an influence diagram where the thickness and color of each arrow indicates the strength of interaction. Arrow plots can also be clustered to improve presentation. A slider is provided to filter out weak interactions and entries not connected by any arrows are removed from the plots. Circle arrow plots are an analogous plot in which entries are plotted in a circle, enabling visualization of interactions within a network (stimulating with cytokines and measuring their release, for example). For each plot type, multiple color schemes are available. Sample plots from data presented in C.W.Espelin et al. (manuscript in preparation) are shown in the DataPflex Reference Guide (Supplementary Material). IC50 plots and clustering require the Statistics Toolbox (The Mathworks) and optimal leaf ordering for clustering relies on the Bioinformatics Toolbox (The Mathworks, Natick, MA). Finally, dataset manipulation and import/export capability are also integrated: Users may (i) rename datasets, (ii) create a new dataset that is a subset of another dataset, (iii) filter or split a dataset into two based on userdefined criteria; and (iv) merge datasets. In addition to .xls files, datasets can be imported and exported from either the MATLAB workspace or .mat files. Lastly, individual datasets (with both mean and standard deviation data) can be rewritten back to the Excel import .xls format and are ready for immediate reimport into DataPflex. ACKNOWLEDGEMENTS The authors would like to thank Arthur Goldsipe, Julio SaezRodriguez, Leo Alexopoulos, Natalia Boukharov, Deb Nathan, Marya Liimatta, Lucy Sun, Fei Hua, Nidhi Jhawar, Rocco Coli and Romesh Subramanian for their helpful discussions, application testing and feature ideas. Funding: This work was funded by Pfizer, Inc. Conflict of Interest: B.S.H. and C.W.E. hold stock in Pfizer, Inc. REFERENCES Cosgrove,B.D. et al. (2009a) A multipathway phosphoproteomic signaling network model of idiosyncratic drug- and inflammatory cytokine-induced toxicity in human hepatocytes. Conf Proc IEEE Eng Med Biol Soc., 1, 5452–5455. Cosgrove,B.D. et al. (2009b) Synergistic drug-cytokine induction of hepatocellular death as an in vitro approach for the study of inflammation-associated idiosyncratic drug hepatotoxicity. Toxicol. Appl. Pharmacol., 237, 317–330. Gaudet,S. et al. (2005) A compendium of signals and responses triggered by prodeath and prosurvival cytokines. Mol. Cell Proteomics, 4, 1569–1590. Mitsos,A. et al. (2009) Identifying drug effects via pathway alterations using an integer linear programming optimization formulation on phosphoproteomic data. PLoS Comput Biol., 5, e1000591 [Epub ahead of print 4 December 2009]. Pritchard,J.R. et al. (2009) Three-kinase inhibitor combination recreates multipathway effects of a geldanamycin analogue on hepatocellular carcinoma cell death. Mol. Cancer Therap., 8, 2183–2192. Saez-Rodriguez,J. et al. (2008) Flexible informatics for linking experimental data to mathematical models via DataRail. Bioinformatics, 24, 840–847. Samaga,R. et al. (2009) The logic of EGFR/ErbB signaling: theoretical properties and analysis of high-throughput data. PLoS Comput. Biol., 5, e1000438. 433
© Copyright 2026 Paperzz