SP17 Lecture Notes 1 - Basic Concepts and Types of Studies

Lecture Notes 1:
Terminology and Statistical Studies
Outline:
-What is statistics?
-What is data?
-Variables
-Samples
-Inference
-Parameters & statistics
-Sampling and bias
-Types of studies
-Confounding variables
-Blinding & double blinding
1
What Is Statistics?
• Statistics is the science of how data is collected,
analyzed, and interpreted.
• Data is information (this is an intentionally vague
definition).
• Typically data takes the form of observed
measurements (e.g. height, temperature) or
descriptions (e.g. dog, cat, blue, female)
2
• Less dry definition: statistics as a discipline concerns
how we use data to say things about the real world. We
want to turn raw information into useful knowledge!
• We might want to summarize data. This can be done
visually (“here’s a graph of last week’s temperatures”) or
with numbers (“the average temperature last week was
42 degrees Fahrenheit”).
• We might want to use data to learn about a quantity that
is unknown or practically unknowable.
For instance, how many trees are there in Rocky
Mountain National Park? No one knows! But maybe we
can gather some data and get an approximate idea.
3
Wise words:
“I believe that it would be worth trying to learn
something about the world even if in trying to do so we
should merely learn that we do not know much. This
state of learned ignorance might be a help in many of
our troubles. It might be well for all of us to remember
that, while differing widely in the various little bits we
know, in our infinite ignorance we are all equal.”
-Karl Popper
“Knowledge Without Authority”, 1960
4
• Uncertainty is rampant! We use statistics to
answer questions about things that are unknown
(e.g. the number of trees in Rock Mountain National
Park), but just as importantly, we use statistics to
get an idea of how sure we are of our answers.
Sometimes it is useful to know that you don’t know!
• Uncertainty will be a theme of this class. “We can’t
know for sure” will be the answer to a lot of
questions. Statistical tools are useful for making
the most out of limited information.
5
Some statistical statements
• “I sleep seven hours per night, on average”
• “Cam Newton set NFL rookie records for passing yards in a game,
yards in a season, yards in back to back games, and touchdowns
in a season”
• “48% of likely voters intend to vote for the incumbent, with a
margin of error of +/- 3%”
• “Students who attend class regularly tend to get higher grades
than those who do not”
6
Terminology
• The next group of slides will cover commonly used
terminology found in the discipline of statistics.
• Most of these terms are best understood in the context
of a research study.
• For the purpose of illustration, we will suppose that a
study is being done for the purpose of estimating the
average height of all adult females living in the U.S. 7
Variables
• Variables are items (usually represented by letters or symbols) of
interest which can take on different values.
• You can usually tell what a variable of interest is by looking at data
and noting the type of measurement being taken.
• For example, if we have collected data on the heights of adults females
living in the US, our variable is simply height.
•
If we denote height as “X” Our data might look like:
X1
X2
X3
X4
X5
X6
X7
65”
70”
61”
67”
64”
67”
63”
8
Population
• A population is the entire overall group we are
interested in.
• So, if we are trying to get an idea of the average
height of US adult females, then the population of
interest is all US adult females. Sometimes this is
called the target population.
• Populations can be large or small. Usually they are
large.
9
Sample
• A sample is a subset of the entire population that we collect data on. The variable(s) of
interest is/are measured on each member of the sample.
• A single member of a sample is called an observation.
• We take samples because the entire population of interest is usually not available to us.
• For instance, if we are studying the height of US adult females, we will collect height data
from a sample. We are not going to measure the height of every woman in the United States.
10
Census Data
• In the rare case that measurements are obtained from every
member of a population, then we say we have a census.
• Every 10 years, the federal government runs a census, and
the goal is to collect other information on every single
person in the country.
• If our population is small, we may have census data. If it is
large, we usually don’t. We aren’t going to measure the
height of every woman in the US, or count all the trees in
Rocky Mountain National Park.
11
Two fields of statistics
• There are, broadly speaking, two major fields of
statistics: descriptive and inferential.
• Descriptive statistics involve describing a dataset:
we could make a graph of it, or tell you its average
and how spread out it is. We could tell you any
interesting features about it.
• However, in descriptive statistics, we limit ourselves
to describing the data itself. We do not generalize
facts about the dataset to a larger group.
12
Inference
• If we generalize from a sample to a population, we are
performing inferential statistics. • These sorts of generalizations are called inferences, and
the investigator is said to “draw inference” on the
population, using the sample.
• e.g. we can take a sample of 100 women and use their
average height to draw inference on the average height for
the entire country.
• Statistical inferences always contain uncertainty. Our
estimate for average height may be wrong!
13
Parameters
A parameter is a numeric characteristic pertaining
to a population. In our example, the parameter of
interest is the average height of all US adult females.
Most often parameters cannot be determined
because we do not have census data.
However, we can still draw inference on a parameter
with a statistic. We just can’t know what they are
with certainty.
14
Statistics
• A statistic is any number you calculate using data.
We use statistics to estimate parameters.
• In our example, if we use the average height of the
women in our sample to estimate the average height
of women in the country, then this sample average is
our statistic.
• But remember, the statistic is just an estimate! We
can’t ever know the true parameter value unless we
are able to measure the entire population.
15
Putting it all together
• So, in our example, we can take a sample of US women from the
population of all US women and measure their height (the
variable of interest), which gives us our data. Each individual
woman in our data set is an observation. We can then calculate
the average height of the women in our sample, which is our
statistic. We use this statistic to draw inference about the
average height of all women in the country, which is our
parameter.
16
Major concerns in statistical
inference
• We have to be careful when generalizing from a sample
to a population. The descriptive characteristics of your
sample may not accurately reflect the characteristics of
the population you are studying.
• Two important questions in statistical inference are:
-Is my sample big enough?
-Is my sample representative of the population of
interest?
17
Small sample sizes
• If our sample is small, we can’t confidently draw
inference to the population.
• For example, if we measure the height of three women,
their average height might not be a good estimate for
the true average height of all women. Maybe we
happened to sample three tall women. Or three short
women.
• Smaller samples mean greater uncertainty when
drawing inference to a population.
18
Bias
• A statistic is said to be biased if it is created in such a way that we would
expect it to differ systematically from the population parameter that it is
meant to estimate.
• For example, if our sample consists entirely of the CSU volleyball team,
then we will probably end up considerably overestimating the true average
height of all US women.
• This is an example of sampling bias, which is a type of bias that arises
when your sample is not representative of the population of interest. 19
Bias
• Another example of sampling bias comes from political polling. Polling firms
used to contact potential voters using only telephones on land lines.
However, with the advent of cell phones, many people do not have land lines.
• This can lead to sampling bias because people who have land lines differ
systematically from people who only use cell phones with respect to age.
Younger people tend to use only cell phones, and younger people tend to
vote differently than older people.
• If a polling firm only calls land lines, its sample will be biased toward the
opinions of older voters.
20
Bias
• In addition to sampling bias, we will consider two
other common forms of bias: self selection bias and
non-response bias.
• Self-selection bias can occur when people choose if
they want to be included in a sample. If the reason for
their choosing to be in the sample is related to what is
being measured, bias can result.
21
Self-selection bias
• For example, during the last Republican presidential primary
debates, many websites posted polls where people could vote on
who won the debate.
• In nearly every poll, Ron Paul came out as the winner. But in the
actually primary elections, Ron Paul rarely came close to winning.
• This is because Ron Paul’s supporters were well organized on the
internet, and made coordinated efforts to vote in these online polls.
22
Self-selection bias
• Self-selection bias also occurs in online reviews on websites like
amazon.com or yelp.com
• Often you will see that most reviews are either 5 stars or 1 star.
Perhaps this means that most people have sharply differing views
regarding the thing being reviewed. But perhaps self-selection bias
is at play.
• Who is most motivated to write a review? Those who feel lukewarm,
or those who feel passionate?
23
Non-response bias
• Non-response bias can occur when certain types of
respondents are more or less likely to answer a survey.
• For instance, if a company’s human resources team is
trying to determine what proportion of their workforce
feel as though they are overworked, sending out a survey
may result in an underestimate of this proportion. After
all, people who are overworked are less likely to take the
time to fill out a survey!
24
Simple Random Sample
• To mitigate against bias, samples should be collected
randomly. • A simple random sample (SRS) is a sample of the
population where every unit has an equal opportunity
to be selected, as in drawing names from a hat.
• Valid inferences can be drawn using data obtained via
a SRS. There are more complicated sampling
methods that also yield valid inferences, but we will
not cover them in this class.
25
Why sample randomly?
• Self-selection bias can be overcome by making sure to select the
members of your sample randomly. If your sample is taken
randomly, then people do not get to choose to be a part of it.
• Sampling bias can also be overcome via random sampling from the
whole population. For instance, if we take a SRS of US adult
women, chances are all of them won’t be volleyball players.
• Sometimes taking a SRS is difficult. How does a polling firm make
sure that every likely voter has an equal chance of being sampled?
26
Why sample randomly?
• Random samples are more likely to be representative of the
population of interest than non-random samples.
• This does not mean, however, that a random sample is necessarily
representative of the population. (maybe, just by chance, we
sampled lots of volleyball players!) It just means that we are not
introducing bias via the way we collect the sample.
• In other words, we might still get a non-representative sample. But
if we do it will have been by random chance, rather than by design.
27
Studies and Experiments
• When we talk about using data to learn something
about the world, we are usually talking about
performing a study or an experiment.
• These come in two major flavors: observational
studies and controlled experiments.
28
Observational Study
• In observational studies variable values are
observed and recorded from already existing data.
• e.g. a survey of a hospital’s death records where
the variable of interest might be “Cause of Death”
will be an observational study.
• Studies involving anything from the past are
necessarily observational.
29
Controlled Experiment
• In a controlled experiment the researcher gets to
assign members of the study to different groups,
which are then subjected to different experimental
conditions (or “treatments”)
• e.g. if we want to determine if a medical procedure
is effective, we can randomly assign people to
either undergo the procedure (the treatment
group) or not (the control group). We can then see
if we observe differences between the groups with
respect to what the procedure was designed to
treat.
30
Controlled Experiment
• In a controlled experiment, the researcher can try to
“control” for factors that could cause bias.
• The idea is to ensure that the treatment and control
groups are as similar as possible to one another
before, so that if differences are observed after the
experiment, they must have been due to the treatment
and not some outside factor.
31
Controlled experiment vs
observational study
• For example, if I want to determine whether weight
loss pills are effective, I could use the following
observational method: I could get a sample of people
who are trying to lose weight, split them into those
who have used a pill and people who have not, and
record how much each person’s weight has changed in
the last 6 months.
32
Controlled Experiment
• Let’s suppose I observe that the people who took a pill
lost more weight than the people who didn’t. This may
seem like solid evidence that the pill works, but it is
not.
• Problem: I have not controlled for other possible
influences on a person’s weight.
• Maybe those who are taking pills are following a
healthier diet and exercising more than the people who
are not taking weight loss pills. In this case, the
observed difference in weight loss might not be due to
the pill, but to other outside factors.
33
Controlled Experiment
• To control for these other factors, I should randomly
assign some participants to take a pill and others to
not take a pill. Perhaps I can also either make sure
they are on a similar diet and exercise regiment, or at
least collect data on their diet and exercise regiments.
• If I still see that people who took the pill lost more
weight, then I might have evidence that the pill caused
the weight loss.
• Controlling for other factors eliminates them as
possible explanations for any observed differences
between the two groups.
34
Side note: the placebo effect
• If we are studying the effectiveness of a pill, we
should also compare it to a placebo. It is often the
case that the act of receiving a treatment that a
person believes will be beneficial will itself result in a
beneficial effect, regardless of whether or not the
treatment is otherwise effective.
• In controlled experiments, the group that gets a
placebo might be called the control group.
Sometimes there is a “no treatment” control group in
addition to a placebo control group.
35
Controlled Experiment
•
In a controlled experiment, we can be reasonably
comfortable inferring that the treatment applied to
one group but not to the other caused whatever
difference we observe between the two.
•
But otherwise we must remember that
correlation does not imply causation!
•
Variables are correlated if they “move together”, i.e.
when one variable changes, the other one tends to
change as well.
•
Examples: temperature is correlated with of time of
day; human height is correlated with weight.
36
Correlation and causation
• If changing one variable causes a change in another, they will be correlated. But the converse
is not necessarily true.
• If two variables are correlated, it is not necessarily the case that one has caused the other to
change.
• For instance, there is a statistical correlation between marriage and income. People who are
married tend to make more money that people who are not married. Does this suggest that
getting married causes an increase in income, or can you think of another explanation?
37
Confounding variable
•
A confounding variable is a variable that helps
explain the data but is not accounted for in the study.
•
Another example: the number of shark attacks during
a given month is directly correlated to the number of
ice cream cones sold in that month.
•
Can you think of another variable that might be
related to ice cream sales and shark attacks?
38
More fun with confounding variables
• A the size of a child’s vocabulary is correlated with
the # of cavities he or she has. Learning how to
speak causes cavities!
• Retail stores report higher sales volumes when their
bathrooms are dirty. Better not clean the
bathrooms or it will hurt sales!
39
More fun with confounding variables
• This is something to really look out for in politics. Suppose the unemployment rate has increased while
a certain presidential administration. Did that
president cause the unemployment rate to increase?
• We see “correlation = causation” type reasoning in
debates about gun control, taxation, immigration,
education, etc. But how can we know that a certain
government policy caused a certain outcome? It is
difficult to do!
40
Blinding
• Blinding and double blinding are methods that are
used to try and eliminate bias that occurs when
researchers and subjects are aware of which group is
which in a study.
• To motivate this concept, consider the “Pepsi
challenge”. This is a taste test that was used by the
Pepsi corporation, in which participants were asked
to compare a cup of Pepsi to a cup of Coca-Cola.
41
Blinding
• If the participants are handed cups labeled “Pepsi” and
“Coke”, then their responses may be biased because
they may have preconceived notions of which product
they prefer.
• In order to eliminate this potential source of bias, the
study should be blinded.
• A blinded study is one in which participants do not
know which group they are in, or which treatment is
which. In the context of a taste test, the study is
blinded if the cups are not labeled and the participants
do not know which cup contains which product.
42
Double Blinding
• However, even if the study is blinded, there could still
be a problem. If the experimenter knows which
product is in which cup, then he or she could subtly
(and possibly unintentionally) influence the outcome of
the experiment.
• To avoid bias being introduced by the experimenter,
the study can be double blinded. A double blinded
study is one in which neither the participant nor the
experimenter know which group is which. In our
example, the experimenter should not know which cup
contains which product.
43
Conclusion
• This has been a quick overview of the major concepts
we will be invoking throughout the semester.
• In the next set of notes, we will look at common ways
of displaying data .
44