Suppose we go out and collect some data. Now what?
a) first, figure out what kind of data you have:
categorical/ordinal/quantitative (discrete/continuous).
b) second, how many records (= cases) do you have?
n = sample size
c) third, look at your data (organize it!)
often, you use graphical methods for this.
Graphical methods:
a) So we sort the data. One good way is with a stem and leaf plot (see example 2.8, p. 22 [p.
18] {not in 4th edition} {not in 5th edition})
- (illustrate using heights)
- key -> what are units?
- we now have our first “look” at the overall data.
- turn stem and leaf sideways -> resembles histogram.
b) Histograms:
- used a lot. Useful to show frequency distributions.
- Frequency distribution - in our case, shows how many of each type are in a particular
group.
- See last figure in example 2.4. This shows how many sows had a specific litter size.
- Use example with heights.
- Use 1) foot intervals, 2) 1/2 foot intervals.
- these intervals are also called class widths.
- Go through example 2.6, pp. 17 & 18 [pp. 14 & 15] {31 - 34} {31 - 34}.
- Note shape of distribution:
- mode - highest value in a distribution. I.e., the value which has the highest
frequency.
- sometimes a distribution can be bimodal.
- tails - the ends of the distribution.
- Comment: the width of each class does not necessarily need to be the same. See fig.
2.13, p. 22 [p. 18] {not in 4th edition} {not in 5th edition}.
- also see what the problems can be.
d) One of the most important functions of histograms is to give you an idea how your data is
distributed (of course, this will depend a little on what kind of histogram you make).
- overlay our histogram by a smooth curve. What does this curve look like?
- See fig. 2.2, p. 26 [p. 21] {fig. 2.2.13, p. 35} {fig. 2.2.13, p. 35} in text.
- Illustrate:
- symmetric, bell shape. Very common in biological data.
- symmetric, with long tails. This can create problems in analyses.
- skewed to right/left.
- exponential
- bimodal
- uniform
- symmetric, U shaped.
- see examples on p. 27 [p. 21] {36} {36}.
- One of the first things you should do in any statistical analysis is figure out what kind
of distribution your sample has. This can determine what kind of analysis you can do.
We will learn more sophisticated techniques for doing this later.
- We will deal with distributions again, soon.
e) Boxplots (see section 2.5 [2.5] {2.4} {2.4} in text (but note that we will always make
“modified” boxplots)).
- Boxplots are a way of summarizing the data in a very quick and understandable
fashion.
- Problem: boxplots can be done differently by different people. Your text presents one
way (probably the most common), but there are others.
- The basic idea is as follows:
1) determine the median.
2) determine Q1 and Q3.
- what are Q1 and Q3?
- the median divides the data in half.
- Q1 divides the lower half in half, Q3 divides the upper half in half.
3) multiply the IQR (interquartile range, = Q3 - Q1) by 1.5.
4) Draw a horizontal (or vertical) line (an axis) going from somewhere below the
minimum value to somewhere above the maximum value.
5) Draw a line for the median.
6) Draw a box extending from Q1 to Q3.
7) Draw a line going to the minimum value if this is GREATER than or equal to
(Q1-1.5xIQR).
- If not, draw a line to the smallest value greater than or equal to (Q11.5xIQR).
8) Draw a line going to the maximum value if this is LESS than or equal to
(Q3+1.5xIQR).
- If not, draw a line to the largest value less than or equal to
(Q3+1.5xIQR).
9) Draw individual dots for any data values outside this range.
- Best to look at an example:
See example 2.21 on p. 40 [2.21, p. 33] {2.4.2, p. 46} {2.4.2, p. 46} of your text.
Here is exercise 2.24 from p. 47 [2.31, p. 38] {2.4.2, p. 51} {2.4.2, p. 51}. MAO
is an enzyme that may be involved in schizophrenia, and these levels were
measured (in nmoles/108 platelets) in 18 patients:
1) Determine the median:
- arranging the data from largest to smallest:
4.1
5.2
6.8
7.3
7.4 7.8
7.8
8.4
8.7 _
9.7
9.9
10.6
10.7
11.9 12.7
14.2
14.5
18.8
- the median is (8.7+9.7)/2 = 9.2
- Q1 is 7.4
- Q3 is 11.9
- the IQR (interquartile range) is Q3-Q1 = 4.5
- multiply this by 1.5: 4.5 x 1.5 = 6.75
Now arrange all this into a boxplot (we’ll go over the details in lecture):
--------------------------I
+
I-------*
---------------+---------+---------+---------+---------+---------+------ C1
3.0
6.0
9.0
12.0
15.0
18.0
As mentioned, not all boxplots are alike:
- sometimes the mean is used
- sometimes CI’s (confidence intervals) are used instead of IQR’s for the length
of the box.
- sometimes the whiskers go to the extremes, even if these are outside the range
we use above.
- so, hopefully the person who made the boxplot will include a key to tell you
what all the things are.
- to make things worse, there are slightly different ways of getting Q1 and Q3.
The differences are minor, but can lead to some confusion (even Minitab does it
two different ways!).
Boxplots can give you a good initial comparison between two different populations. See
p. 46 [37] {51} {not in 5th edition} in your text. For example, here’s part of exercise
7.84 [7.102] {7.S.18} {7.S.18}. In this example, researches wanted to know if infection
with malaria hurt the ability of lizards to run. They measured how far (meters) the
lizards could run in two minutes:
1
-----------------------I
+ I-------------------------
---------------2
--------------I
+
I--------------------------------------+---------+---------+---------+---------+--------distance
18.0
24.0
30.0
36.0
42.0
(1 is infected, 2 is uninfected)
One can tell almost right away that there is probably an impact of malaria on lizards.
(Note: we haven't proven (in the statistical sense) that there is an impact of malaria!)
© Copyright 2026 Paperzz