DataPflex: a MATLAB-based tool for the manipulation and

BIOINFORMATICS APPLICATIONS NOTE
Vol. 26 no. 3 2010, pages 432–433
doi:10.1093/bioinformatics/btp667
Systems biology
DataPflex: a MATLAB-based tool for the manipulation and
visualization of multidimensional datasets
Bart S. Hendriks∗,† and Christopher W. Espelin
Pfizer Research Technology Center, 620 Memorial Drive, Cambridge, MA 02139, USA
Received on October 13, 2009; revised on November 14, 2009; accepted on November 27, 2009
Advance Access publication December 4, 2009
Associate Editor: Trey Ideker
ABSTRACT
Motivation: DataPflex is a MATLAB-based application that facilitates
the manipulation and visualization of multidimensional datasets. The
strength of DataPflex lies in the intuitive graphical user interface for
the efficient incorporation, manipulation and visualization of highdimensional data that can be generated by multiplexed protein
measurement platforms including, but not limited to Luminex or
Meso-Scale Discovery. Such data can generally be represented in the
form of multidimensional datasets [for example (time × stimulation
× inhibitor × inhibitor concentration × cell type × measurement)]. For
cases where measurements are made in a combinational fashion
across multiple dimensions, there is a need for a tool to efficiently
manipulate and reorganize such data for visualization. DataPflex
accepts data consisting of up to five arbitrary dimensions in addition
to a measurement dimension. Data are imported from a simple .xls
format and can be exported to MATLAB or .xls. Data dimensions
can be reordered, subdivided, merged, normalized and visualized
in the form of collections of line graphs, bar graphs, surface
plots, heatmaps, IC50’s and other custom plots. Open source
implementation in MATLAB enables easy extension for custom
plotting routines and integration with more sophisticated analysis
tools.
Availability: DataPflex is distributed under the GPL license
(http://www.gnu.org/licenses/) together with documentation, source
code and sample data files at: http://code.google.com/p/datapflex.
Contact: DataPfl[email protected]; bhendriks@merrimackpharma.
com
Supplementary information: Supplementary data available at
Bioinformatics online.
1
INTRODUCTION
The scale of protein-based measurements is rapidly increasing with
the advent of so-called ‘multiplexing’ technologies for antibodybased measurement offered by platforms such as Luminex and
Meso-Scale Discovery. These technologies are readily amenable
to intricate experimental layouts which simultaneously investigate
multiple experimental conditions, treatments and time points
simultaneously. Examples of such complex and large-scale proteinbased datasets are starting to appear in the literature (Cosgrove et al.,
∗ To
whom correspondence should be addressed.
address: Merrimack Pharmaceuticals, 1 Kendall Square,
Cambridge, MA 02139, USA.
† Present
432
2009a,b; C.W.Espelin et al., manuscript in preparation; Gaudet et al.,
2005; Mitsos et al., 2009; Pritchard et al., 2009; Samaga et al., 2009).
In Espelin et al., for example, 73 protein measurements distributed
across 12 time points, 15 stimulation conditions, 11 inhibitors at five
concentrations in the U937 cell line create a rich multidimensional
hypercube of data. For such datasets (in excess of three dimensions),
common plotting tools are woefully inadequate for rapid reordering,
manipulation and visualization of data.
Cell-based experiments can be described in terms of their
underlying dimensions. For example: (cell type × stimulus ×
compound × compound concentration × time × measurement)
or (stimulus × stimulus concentration × cell type × medium
conditions × compound × measurement). Plotting this kind of
data using a package such as Excel is extremely time consuming,
requiring the generation of plots one at a time and does not enable
the rapid reorganization of data to generate plots as a function
of time or concentration, for example. When plotting datasets of
this nature, it is desirable to (i) be able to view the hypercube
of data from different angles (rapidly reorder the dimensions), (ii)
enable basic dataset manipulation, (iii) offer multiple means of data
normalization, (iv) easily view subportions of the entire dataset, (v)
offer multiple types of plots; and (vi) enable future development
and integration with other analysis tools. DataPflex was developed
to address these needs, with careful recognition that an efficient and
intuitive interface would be critical for acceptance and usage by the
bench biologists who generate such datasets.
DataPflex fills an important niche between other software
platforms, such as Spotfire and DataRail (Saez-Rodriguez et al.,
2008), with its combined capabilities for handling multidimensional
data, creating multiple plots simultaneously and ease of use.
2
METHODS AND IMPLEMENTATION
DataPflex was written entirely in MATLAB R2007b (The Mathworks,
Natick, MA, USA; the graphical user interface has known incompatibility
with prior MATLAB releases). All experimental data must be categorized
into six dimensions: (five arbitrary dimensions and a measurement
dimension). Each dimension can be either textual (inhibitor names, for
example) or numeric (time points or concentrations, for example) in nature.
For dataset in excess of six dimensions, it is recommended that such datasets
be split into multiple input files.
Data is imported from an Excel (.xls) file, consisting of two blocks
separated by an empty column. The first block is a description block with
up to five dimension headings of the form ‘dimension name’ or ‘dimension
name = dimension units’. For cases where there are pairs of corresponding
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
DataPflex
Table 1. Example Excel (.xls) data format #1 for import into DataPflex (individual replicates)
Cell Type
Stimuli
Time=min
Inhibitor
Inhibitor Concentration=uM
[Data Description Block]
Fig. 1. DataPflex screenshot.
dimensions (inhibitor and inhibitor concentration, for example), dimension
headings should be [name] and [name] concentration = units, where [name]
is the name of the dimension (inhibitor and inhibitor concentration = uM, for
example). These column headings may be in any order and the description
block may contain an arbitrary number of additional columns with empty
headings. The second block is the measurement data, separated from the
description block by an empty column. The data block can be in one of two
formats: (Table 1) with column headings correspond to the measurements
performed or (Supplementary Material) with column heading corresponding
to the mean and standard deviations for each measurement. In all column
headings, the unit description (=units) is optional. The first row of the
file must contain the headings and subsequent rows contain corresponding
experimental descriptions and data. For import format #1 (Table 1) replicates
are entered as in separate rows and automatically averaged upon import with
SDs calculated. A sample dataset from Espelin et al. (C.W.Espelin et al.,
manuscript in preparation) is provided in the Excel import format.
Following import, the six dimensions are shown within the DataPflex
interface and can be efficiently and easily reordered with drop-down menus
and subportions selected in each window (Fig. 1). Individual labels can be
renamed and reordered within a given dimension. Multiple normalization
options are offered for plotting, including raw data, normalization by the max
value, log normalization, normalization relative to conditions/time courses
within the dataset and several others. Additionally, the data corresponding to
‘No Treatment’ can be automatically added to each plot (bottom left of GUI,
Fig. 1). Allowing, for example, the comparison of a treatment condition
with background. Once the dimensions and data normalization have been
selected, several plot types may be generated: line graphs, scatter plots, bar
graphs, 3-D surface plots, mean/stdev, IC50’s, heatmaps and arrow plots.
Mean/Stdev plots display the mean and standard deviation of each selected
measurement in the dataset in order to obtain a general view of which species
in the dataset are showing significant variability.
Additional options exist for each plot type: (i) data can be clustered by
row and/or column; and (ii) additional dimensions can be ‘unfolded’ (i.e.
concatenated) into either the rows or columns. For heatmaps, when the third
dimension is a numeric dimension, such as time or concentration, the average
Measurement
#1=units
Measurement
#2=units
…
[Experimental Data Block]
value is plotted and the figure can be clicked on to visualize the underlying
time series data. Line arrow plots contain the same information as heatmaps,
but are presented in the form of an influence diagram where the thickness
and color of each arrow indicates the strength of interaction. Arrow plots
can also be clustered to improve presentation. A slider is provided to filter
out weak interactions and entries not connected by any arrows are removed
from the plots. Circle arrow plots are an analogous plot in which entries are
plotted in a circle, enabling visualization of interactions within a network
(stimulating with cytokines and measuring their release, for example). For
each plot type, multiple color schemes are available. Sample plots from data
presented in C.W.Espelin et al. (manuscript in preparation) are shown in
the DataPflex Reference Guide (Supplementary Material). IC50 plots and
clustering require the Statistics Toolbox (The Mathworks) and optimal leaf
ordering for clustering relies on the Bioinformatics Toolbox (The Mathworks,
Natick, MA).
Finally, dataset manipulation and import/export capability are also
integrated: Users may (i) rename datasets, (ii) create a new dataset that is a
subset of another dataset, (iii) filter or split a dataset into two based on userdefined criteria; and (iv) merge datasets. In addition to .xls files, datasets can
be imported and exported from either the MATLAB workspace or .mat files.
Lastly, individual datasets (with both mean and standard deviation data) can
be rewritten back to the Excel import .xls format and are ready for immediate
reimport into DataPflex.
ACKNOWLEDGEMENTS
The authors would like to thank Arthur Goldsipe, Julio SaezRodriguez, Leo Alexopoulos, Natalia Boukharov, Deb Nathan,
Marya Liimatta, Lucy Sun, Fei Hua, Nidhi Jhawar, Rocco Coli
and Romesh Subramanian for their helpful discussions, application
testing and feature ideas.
Funding: This work was funded by Pfizer, Inc.
Conflict of Interest: B.S.H. and C.W.E. hold stock in Pfizer, Inc.
REFERENCES
Cosgrove,B.D. et al. (2009a) A multipathway phosphoproteomic signaling network
model of idiosyncratic drug- and inflammatory cytokine-induced toxicity in human
hepatocytes. Conf Proc IEEE Eng Med Biol Soc., 1, 5452–5455.
Cosgrove,B.D. et al. (2009b) Synergistic drug-cytokine induction of hepatocellular
death as an in vitro approach for the study of inflammation-associated idiosyncratic
drug hepatotoxicity. Toxicol. Appl. Pharmacol., 237, 317–330.
Gaudet,S. et al. (2005) A compendium of signals and responses triggered by prodeath
and prosurvival cytokines. Mol. Cell Proteomics, 4, 1569–1590.
Mitsos,A. et al. (2009) Identifying drug effects via pathway alterations using an integer
linear programming optimization formulation on phosphoproteomic data. PLoS
Comput Biol., 5, e1000591 [Epub ahead of print 4 December 2009].
Pritchard,J.R. et al. (2009) Three-kinase inhibitor combination recreates multipathway
effects of a geldanamycin analogue on hepatocellular carcinoma cell death. Mol.
Cancer Therap., 8, 2183–2192.
Saez-Rodriguez,J. et al. (2008) Flexible informatics for linking experimental data to
mathematical models via DataRail. Bioinformatics, 24, 840–847.
Samaga,R. et al. (2009) The logic of EGFR/ErbB signaling: theoretical properties and
analysis of high-throughput data. PLoS Comput. Biol., 5, e1000438.
433