Making Sense out of Flow Cytometry Data Overload

Introduction to R
Second Annual Cytomics Workshop
April, 2017
© 2015 by Wade Rogers
Outline
• Background
 R
 Bioconductor
•
•
•
•
Motivating examples
Starting R, entering commands
How to get help
R fundamentals





Sequences and Repeats
Characters and Numbers
Vectors and Matrices
Data Frames and Lists
Importing data from spreadsheets
R
•R
 Is an integrated suite of software facilities for data manipulation,
simulation, calculation and graphical display.
 It handles and analyzes data very effectively and it contains a suite of
operators for calculations on arrays and matrices.
 In addition, it has the graphical capabilities for very sophisticated graphs
and data displays.
 It is an elegant, object-oriented programming language.
 Started by Robert Gentleman and Ross Ihaka (hence “R”) in 1995
 as a free, independent, open-source implementation of the S
programming language (now part of Spotfire)
 Currently, maintained by the R Core development team – an
international group of hard-working volunteer developers
http://www.r-project.org
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
Bioconductor
• Bioconductor
 “Is an open source and open development software project to provide
tools for the analysis and comprehension of genomic data.”
 Goals
 To provide widespread access to a broad range of powerful
statistical and graphical methods for the analysis of genomic data.
 To provide a common software platform that enables the rapid
development and deployment of extensible, scalable, and
interoperable software.
 To further scientific understanding by producing high-quality
documentation and reproducible research.
 To train researchers on computational and statistical methods for the
analysis of genomic data.
http://bioconductor.org/overview
Flow Cytometry in Bioconductor
• About 40 packages specific to flow cytometry
available in Bioconductor
• What’s so different about flow cytometry
anyway?
A motivating example
I’ve just collected data from a T cell stimulation experiment in a 96-well
plate format. I need to gate the data on CD3/CD4. How consistent are
the distributions, so that I can establish one set of gates for the whole
plate and be confident that the results are valid for all of the wells?
A motivating example
Another motivating example
I’m concerned that drawing gates to analyze my data introduces
unintended bias. Additionally, since I have multiple data files,
drawing multiple gates is time consuming. Can I use R to compute
gates and then apply these same objective gating criteria to multiple
data files?
Another motivating example
A third example
I often drain my tubes since I’m trying to acquire as many events as I
can from a limited sample for a rare event assay. I’m concerned that
the disruption of flow near the beginning and end of the acquisition
(and sometimes in the middle due to minor clogs) may introduce an
“artificial phenotype”. Is there some way to automatically detect and
edit out portions of a file that aren’t consistent with the rest?
A third example
Back to the basics
• R is a command-line driven
program
 the prompt is: >
 you type a command
(shown in blue), and R
executes the command
and gives the answer
(shown in black)
Simple example: enter a set of measurements
• use the function c()to combine terms together
• Create a variable named mfi
• Put the result of c() into mfi using the
assignment operator <- (you can also use =)
• The [1] indicates that the result is a vector
Rstudio
Rstudio
Console
Rstudio
Editor
Rstudio
Env, History
Rstudio
Your best friend
Rstudio lower right pane
Rstudio lower right pane
Rstudio lower right pane
Rstudio lower right pane
Rstudio help
Rstudio help
Rstudio help
Rstudio help
Package Vignette – really good help!
BASIC DATA STRUCTURES
Sequences and Repeats
Characters and Numbers
• Characters and character strings are enclosed in “” or ‘’
• Special numbers
•
•
•
NA – “Not Available”
Inf – “Infinity”
NaN – “Not a Number”
Factors
• Factors capture categorical data (variables that take on discrete,
often descriptive, values)
• We’ll see more about factors when we talk about data frames …
Vectors and Matrices
Vectors and Matrices
• The subset operator for vectors and matrices is [ ]
Vectors and Matrices
• You can extend the length of a vector via subsetting
… but not a matrix
Vectors and Matrices
• However, all's not lost if you want to extend either the columns …
… or rows
Data Frames
• A Data Frame is like a matrix, except that the data type in each
column need not be the same (data polymorphism)
 Often, a Data Frame is created from an Excel spreadsheet using the
function read.table() or read.csv()
Save As…
a tab-delimited
text file.
Data Frames from spreadsheets
Data Frames from spreadsheets
Data Frames from spreadsheets
Lists