Title of Presentation - Vision Critical Intranet

Statistics 101 for Market Research
September 2014
[email protected]
Statistics 101
• Sampling
• Margin of Error Statements
• Weighting
• Statistical Significance Testing: t-Test
Sampling
Sampling: Source
River
Panel
River Sample Selection
It is a blend of whatever sample is available at the moment you dip into to try to find
respondents for your study. While you can make river sample look good via
quotas, we can not control who is included. It’s typically not the highest quality
sample.
Panel Sample Selection
Gen Pop Sample
Targeted Sample
Pre-screening through Self–Identification
Statistics: Probability Sample
A probability sample is one where every element in the
population has a known/non-zero probability of being
selected.
• A frame that include every element in the population
• Known probability of selection: a random process of
selection with the chance of selection equal to
predetermined probability
Statistical Inferences (drawing conclusions about the
population based on sample data) can only be done through
the use of probability samples.
7
Sample Selection
Population
Panel
Probability Sample
Panel Sample
Statistics: Stratified Sample
Often the population can be divided into strata, and the sample process
can be repeated within each stratum separately. This is called stratified
sample.
9
Panel Sample Balancing
Panel balancing is essentially a stratified random sample with
stratum made up of combinations of many demographic
variables.
Universe:: AGEP between 18 and 99
Weight used: PWGTP
DataSet(s) selected: 2009 population
Source: ACS Public Use Microdata Sample
Total number of completes - n= 1000
NortheastMidwest South
Male 18 34
27
34
58
Male 35-54
34
40
67
Male 55+
27
32
53
Female 18-34
26
33
56
Female 35-54
35
41
69
Female 55+
34
39
65
Total
184
219
367
West
39
43
32
36
42
37
230
Total
158
185
144
151
187
175
1,000
- US Gen Pop – Age/Gender/Region
10
Response rate adjusted sample balancing
Within the national panels, we have a pretty good idea how likely any
given stratum within the balancing target matrix will respond to the survey
invite.
A sampling analyst may choose to use this information – adjusting for
each stratum based on differential response rate. This allows the ending
sample (i.e. the number of completed survey) within each stratum to be
closer to the census information.
For the commonly used national sample balancing matrices,
\\Netapp01\projects\NATIONAL_PANELS\01_NATIONAL_PANELS\NatPanel_PR
OFILING\Matrices
11
We are rarely interested in true ‘gen pop’
Not really ‘gen pop’ anymore—It’s ‘Gen Pop who…’
Gen Pop
Soup buyers
…aged 24-54
• Sample pull is GenPoP
• In-study screening/ lots of DQ
• Keep quota to a minimum and “real”
Good weighting options
…moms
Example: Do you really need this?
Is it right? Probably NOT.
Quotas
Kids 13-18 @ home
Yes
No
Mom's Age
24-39
n=100
n=100
40-54
n=100
n=100
Targeted Sample
Or we can sample only among self-identified Soup buyers
Soup buyers
…aged
24-54
…moms
• Sample pull is from self-identified
Examples: You have to have quota
Soup buyers
•
•
•
•
In-study screening, Less DQ
Use quota to make sample “look” good.
Tricky to weight.
“Keeners” effect
Quotas
Kids 13-18 @ home
Yes
No
?
Mom's Age
24-39
n=100
n=100
40-54
n=100
n=100
What is a Router Sample?
Complete
Complete
Y
Origin
Study 1
DQ
Y
Complete
…
Origin
Study 2
DQ
Origin
Study K
DQ
Y
N
N
N
Want to be Routed?
N
DQ
Y
Router Study 1
Router Study 2
…
Router Study #P
Margin of Error Statement
With panels, the panel is the population and the frame. We can draw
probability samples from the panel and make inferences from the sample
to the panel only.
From June 18th to June 19th 2014 an online survey was conducted among
1,510 randomly selected Canadian adults who are Angus Reid Forum
panelists. For comparison purposes, a probability sample of this size has a
margin of error of +/- 2.5%, 19 times out of 20.
Inferences from panel sample to the general population is still a dance
step away. However, through the use of sampling balancing and
weighting, we can create a sample as closely matching general
population demographic characteristics as possible.
15
Weighting
Weighting
• Weighting is when we run into issues in sampling
• The effect of weighting is cosmetic:
• Makes sample “look” good
• Does not fix structural problems.
17
It’s all about Proportions
Ideal Sample
Panel Sample
Weighted Sample
18
Weighting Efficiency
Weighted Sample
Unweighted Ideal Sample
n=1,000
weighting efficiency=20%
n=200
weighting efficiency=100%
19
Two Principles
1. Use good information
2. Weight with a light hand
20
Good Information
Relevant, Credible, Independent
• Information match what’s in the sample
• Trusted Sources – in order of preference:
1.
2.
3.
4.
5.
Census
Large scale studies from national agencies
Industry (databases, publications, etc…)
Validated Historical data
When all else fails, VC Omni studies – our best effort at a GenPop
sample.
• National Panels Mosaic study is not a good source. At a minimum, run Omni to
validate.
21
Weight with a light hand
• Only weight where you need to
• Take the “goodness” of source into account
• Use as few variables as you can
• Use as broad classification as you need
• E.g. 5 regions vs. 10 province
• Use RIM weighting if possible
• RIM weighting creates the least amount of distortion to the
data, and results in the best weighting efficiency.
22
Statistical Significance Testing
Statistics: Significance Testing
• A form of Hypothesis Testing:
• Need to form a hypothesis before you can test
• E.g. Consumers are more likely to purchase Concept A than our current
product.
• Concept A has a higher PI (T2B) than Control Concept
• Can only be applied to a probability sample
24
t-Test
• Applied to any sample
• We can only test for differences in our sample, not the population
• Applied to all data tables: proportion, mean
• Applied to any 2 (non-overlapped) subgroups.
• High degree of data fishing
25
Data Fishing
http://xkcd.com/
26
t-Test
Total
Atlantic(A) Quebec(B) Ontario(C) Prairies(D) BC (E)
BDC Client
Are you a BDC client?
Base
732
595
732
100.0%
294
266
335
40.2%
51
50
58
100.0%
33
156
151
190
100.0%
79
257
208
248
100.0%
82
152
101
124
100.0%
62
116
95
112
100.0%
38
63.9%
CE
50.5%
C
32.0%
40.9%
33.0%
64
61
76
8.7%
2
15
17
15
15
4.3%
9.9%
6.5%
9.7%
12.8%
16
62
158
75
63
31.8%
39.6%
61.5%
AB
49.5%
54.2%
Yes
No, but I used to be
374
275
321
51.1%
No, I have never been a BDC client
27
Bonferroni Correction
• Available on Quick Report data tables (by default)
• Apply a more stringent criterion for declaring significance
• Family-wise Confidence level
• Help reducing risk of data fishing
• Report all significant differences flagged
28
Wincross Data Tables
• No Bonferroni correction
• t-Test options for non-independent samples
• Useful for sequential monadic concept testing
29