Organization of data. Suppose we go out and collect some data

Organization of data.
Suppose we go out and collect some data. Now what?
First, what kind of data have you got? (see pages 10 - 11, [9 - 10], {26-27}). Let’s do a simple
experiment - sex & height.
a) categorical - examples are:
i) Not ordinal: you can't sort the categories:
Sex, blood type, geographic location, color
Note that these examples are NOT ordinal (see below), meaning that you
can’t put them into any kind of logical order.
ii) ordinal: you can give your categories some kind of logical order:
Size of t-shirts: small, medium, large, x-large.
Teacher evaluations: 1,2,3,4,5
Incidentally, note that 2 x 2 ≠ 4
(this doesn't even make sense!)
tenderness of beef: tough, slightly tough, slightly tender, tender, very
tender, etc.
b) quantitative:
i) discrete: one can list the possible values. For example,
Number of eggs laid by a Cardinal
Age
Number of spots on a turtle
Usually, but not always, discrete data are integers (0,1,2,3...)
ii) continuous: one can not list all possible values. For example:
Weight of person
Circumference of femur
Length of a snake
However, continuous data are only as accurate as the measuring device
used (we can't measure weight to infinite precision).
Now, what else can you say about your data?
a) how many records do you have?
Definition of record (what your book calls “observational unit”, or “case”). Yes,
you should know all three terms.
Some examples:
Snake length (from above, where you’re measuring length.
Each snake has data associated with it, so each snake would be
considered an “observational unit”, case, or record.
Baby (suppose you're measuring birth weights).
Each baby would have an associated birth weight, so each baby is
a record.
Finally, the number of records that you have is your sample size, for which we use
the letter n in statistics:
n = 13 means you have 13 records.
See also the Samples section in the text on page 11, [10], {27}.
Finally, how can you organize your data?
Often, what you want to do is sort the data.
For example, who is the tallest person in the class?
The shortest?
The third tallest?
This last one can be difficult unless your data are sorted
So we sort the data. One good way is with a stem and leaf plot (see example 2.8, p. 22,
[p. 18], {not in 4th edition(!)})
(illustrate using heights)
key: this tells us what our units are.
Stem and leaf plots give us a good first “look” at the overall data.
Here's another example of a stem and leaf plot:
Suppose we had the following data for heart rates in 19 mice:
592 585 608 599 635 694 591 559 628 721 538 651 718 505 635
633 558 586 579
First we sort the data:
505 538 558 559 579 585 586 591 592 599 608 628 633 635 635
651 694 718 721
Now we arrange the data into our stem and leaf plot:
50 | 5
51 |
52 |
53 | 8
54 |
55 | 89
56 |
57 | 9
58 | 56
59 | 129
60 | 8
61 |
62 | 8
63 | 355
64 |
65 | 1
66 |
67 |
68 |
69 | 4
70 |
71 | 8
72 | 1
Key: 50|5 = 505 bpm
This one's a bit “long”, in that it covers most of the page. Note also that it has lots
of “0” values (locations for which we have no data).
We could re-arrange this and do the following:
5 | 146689999
6 | 01334459
7 | 22
Key: 5|1 = 510 bpm
Note that in this example the data are obviously rounded to the nearest 10 beats
and only two digits are given.
(The data are first rounded to two digits, then put into the stem and leaf
plot).
Both of the above are valid stem and leaf plots.
This does illustrate the importance of giving a key.
Histograms and barplots:
Histograms:
These are used a lot, and are very useful in showing us frequency distributions. Used for
quantitative (continuous) data.
More on distributions soon.
Let's construct a histogram for our height data:
You divide your data into groups (e.g., 60 - 62, 63 - 65, 66 - 68, etc.), then count
the number of data points in each group, and finally plot this.
Let's try:
1) foot intervals
2) 1/2 foot intervals.
(These intervals are also called class widths).
Note that the histogram looks different depending on your class widths.
This can be a problem with histograms, although it's usually only serious
if your sample size is smaller.
There are actually formulas you can use that give you “ideal” class widths;
we won't use these, but they are built into lots of software (like R).
See also the examples in your text on pp. 17 & 18, [pp. 14 & 15], {31 - 34}.
Barplots:
Barplots are similar to histograms, but are used for categorical data or discrete data.
Count the number of data points for each category (or discrete number), and plot
this.
If the data are discrete, then the x-axis is usually fixed:
Using the piglet example, we would plot 5 piglets/mother, then 6
piglets/mother, and so on - it wouldn't make sense to plot 8 piglets/mother
before 6 piglets/mother on the x-axis)
If the date are not discrete, then a bar plot can have different shapes:
Suppose you have four different colors of flowers, and want to plot the
number of flowers of each color using a bar plot. Which color comes
first?
This can give you different bar plots for the same data (using R):
Histograms vs. barplots:
Histograms are used for continuous data.
Barplots are used for discrete or categorical data.
Technically, the bars in a histogram “touch” to indicate the data are continuous.
Similarly, the bars in a barplot do not touch.
Shapes of distributions:
First, what do we mean by distribution?
A distribution describes what our data look like:
How many short people are there? How many tall people? How many
“average” people (note: we have not defined average yet)?
We visualize this using a graph. A histogram is a good start, but usually
we overlay this with a smooth line
Some definitions:
Mode: the highest value in a distribution. For example, which value (or set of
values) has the highest frequency?
Tails: the ends of the distribution. Usually these are quite skinny (close to the
x-axis), but they don't have to be.
So, what kind of shapes can we get?
See fig. 2.2, p. 26, [fig. 2.21, p. 21], {fig. 2.2.5, p. 36}, in text.
Illustrate:
Symmetric, bell shape. Very common in biological data.
See also the yellow curve on the class web page.
Very common in biological data; also very important to statistics.
Symmetric, with long tails. This can create problems in analyses.
See, for example, the red curve on the class web page.
Skewed to right/left.
Note: skewed right means “long tail on right”.
Exponential
This does have two tails; the tail on one side is just very fat.
Bimodal
A distribution with two peaks, not necessarily the same height (if
one peak is higher, it is the one that's the mode).
Uniform
Every outcome is equally likely (e.g., just as many short as tall as
medium people).
Symmetric, U shaped.
Oddly, this is usually easier to deal with than a long tailed
distribution (the tails usually stop abruptly, despite being fat).
One of the first things you should do in any statistical analysis is figure out what
kind of distribution your sample has. This can determine what kind of analysis you
can do.
We will learn more sophisticated techniques for evaluating distrubions later.
Distributions will come back again and again - we're not done with the topic!
Notation
Capital & lower case letters:
Y
represents a variable. For instance, birth weight.
“Y” says nothing about an actual “value”.
Y is considered to be random (we don't know what it is, it just represents,
for example, “birth weight”)
y
represents an actual value for Y. For instance, 9 pounds, 13 ounces in
birth weight of babies.
y is no longer random - we know what it is (9 pounds, 13 ounces)
yi
represents the observation in the “i”th place.
For example, you collect the following data:
14, 12, 16, 23, 18, 17
then we have:
y1 = 14
y2 = 12
y3 = 16
y4 = 23
y5 = 18
y6 = 17
Now, one more symbol:
Capital Sigma, Σ, is used as a symbol which means “Sum”.
Essentially it means add up the stuff after symbol.
Depending on your edition, the text isn't very good about explaining this correctly.
Here’s how it’s used, using the above numbers as an example:
6
∑ y =y y y y y y
i
1
2
3
4
5
6
i=1
or in our case:
6
∑ y =141216231817=100
i
i=1
Some editions omit the numbers/symbols above and below the sigma.
Most often, this doesn’t make much of a difference, but at times this does
become important.
You should get in the habit of using these together with the Sigma.
For example, suppose we only want to add the first three numbers:
3
∑ y =141216=42
i
i=1
Here are a couple of other ways this can be used:
4
∑ i=10
i=1
or:
5
5
i=4
i=4
∑ 3 × i = 27 = 3 ∑ i
We’ll get to see some more complicated examples of how this works when we do
means and variances and sums of squares next class.
If you want, you can try the following:
5
1)
∑y
5
5
i
i=3
2)
∑y
i1
3)
∑ 15y
i
i=1
i=1
Answers:
6
4)
∑ 15− y 
i
i=1
1) 57
2) 86
3) 1245
4) -10