Visualizing Multivariate Data – R Session 3

Visualizing multivariate data using R1
Xijin Ge
1. mtcars
SDSU Math/Stat
#mtcars is a dataset of cars
mp g c yl d i s p h p d r a t
wt q s e c vs a m ge a r c a r b
Mazda RX4
2 1 . 0 6 1 6 0 . 0 1 1 0 3 . 9 0 2 .6 2 0 1 6 .4 6 0 1
4
4
M a z d a R X 4 Wa g
2 1 . 0 6 1 6 0 . 0 1 1 0 3 . 9 0 2 . 87 5 17 . 0 2 0 1
4
4
Datsun 710
2 2 . 8 4 1 0 8 . 0 9 3 3 . 8 5 2 . 3 20 1 8 . 6 1 1 1
4
1
> ? mt c a r s
# wh a t d o e s “ a m ” s t a n d s fo r ?
2. Bubble plot is a further extension of scatter plot. It uses an additional dimension of data to determine
the size of the symbols. Interesting video using bubble plot: http://youtu.be/jbkSRLYSojo
>attach(mtcars) #attach the motor trends dataset
> plot(wt,mpg,pch=16,cex=hp/50,col=rainbow(8)[cyl]) #wt and mpg are x, y coordinates, hp is size.
> points(wt,mpg,cex=hp/50)
10
15
20
mpg
25
30
Fig.1 actually represents information on four variables, i.e., four-dimension. We could see that these
four variables (wt, mpg,hp and cyl) are all correlated! I am curious about the lightest car that has a
relatively big horse power.
2
3
4
5
wt
Figure 1. Bubble plot of the mtcars dataset. Size of the bubble corresponds to hp(horse power) and color denotes
cyl(cylinder). Green, blue, and red denote 4, 6 and 8 cylinders, respectively.
1
This workshop is supported by National Science Foundation RII Track I. This material is based on
http://tryr.codeschool.com/
Page 1 of 4
3. Mosaic plot can display correlation between multiple categorical variables.
>heartatk = read.table("http://statland.org/R/RC/heartatk4R.txt",header=TRUE) # load in data
>fix(heartatk) # make sure CHARGES is numeric, you could also use R menu EditData Editor.
> attach(heartatk) # attach the heartatk dataset.
> a =table(SEX,DIED) #tabulates SEX and DIED and generate counts in a 2d array.
>mosaicplot(a, color = T) # generate mosaicplot
a
M
1
DIED
0
F
SEX
Figure 2. Mosaic plot by two factors (SEX, DIED) in the heartatk dataset.
The size of the four blocks in the figure represents the counts of the corresponding combination. Also
note that the blocks are also color-coded for different combination. Horizontally, the blocks are divided
by SEX, we could observe that there are more men in this dataset than women. Vertically, the blocks are
divided by DIED (1 for died in hospital). We could conclude that regardless of gender, only a small
proportion of patients died in hospital. Between men and women, we also see that the percentage of
women that died in hospital is definitely higher than that in men. This is a rather unusual observation.
More complicated mosaic plots:
> mosaicplot(~ Sex + Age + Survived, data = Titanic, color = T) # mosaic plot by SEX,Age, Survived.
>mosaicplot(Titanic, color=T)
# mosaic plot of the whole Titanic dataset
4. Graphical display of multivariate data using “lattice” package
>library(lattice) # You need to install this package first. R main menu  Packages Install
>histogram(~AGE | factor(SEX)) #generate a panel of histogram for LOS by factor SEX.
Page 2 of 4
> histogram(~AGE | factor(SEX),layout=c(1,2)) # result shows in Fig.1. Panel is one column two rows.
M
20
15
Percent of Total
10
5
0
F
20
15
10
5
0
20
40
60
80
100
AGE
Figure 3. A panel of histograms in heartatk dataset generated by lattice package.
> densityplot(~AGE,groups=SEX,plot.points=F, auto.key=list(columns=2))
F
M
Density
0.03
0.02
0.01
0.00
20
40
60
80
100
AGE
Figure4. Overlaid density plot.
> bwplot( AGE ~ as.character(DRG) | SEX ) # Box and whisker plot in Fig.5
As you could see this clearly indicate our previous observation people who did in hospital are older than
survivors and that patients who developed complications are older than those that did not. Question:
Did people with complications stayed longer in the hospital? (121-with complication, 122-no
complication. 123, died.)
Page 3 of 4
Question: Can you produce a figure that compares the age distribution of patients who died in the
hospital to those who survived? Break down by gender.
Figure 5. Multiple boxplots by lattice package.
Reading files into R:
1. Download and unzip (7-zip, WinRAR, tar and gzip, Winzip…)
2. Text file (CSV, txt, …) or Binary file (XLS, XLSX, …)?
3. Open with text editor (TexPad or NotePad++) to have a look.
a. Rows and columns? Row IDs and column names? header=
b. Deliminaters?(space, comma, tab…) sep=
c. Missing values? NA, na, NULL, blank, NaN, 0 missingstring=
4. Optional: Open as text file with Excel, choose appropriate deliminater while importing, or use
the Text to Column under Data. Beware of the annoying automatic conversion in Excel “OCT4”
 “4-OCT”. Edit column names, save as CSV, or Tab-deliminated text file for reading in R.
5. Change working directory to where the file was saved. Main menu: File  Change dir…
6. scan(), read.table(), or read.csv()
x=read.table(“somefile”, sep=”\t”,header=TRUE, missingstring=”NA”)
7. double check the data with fix(), click on the column name to make sure each column is
recognized correctly as “character” or “numeric”
Page 4 of 4