A. Diversity Value method

Sampling
for
Variety
Matti Miestamo
Dik Bakker
Antti Arppe
Overview
Sampling for Variety
2
Overview
1. Introduction: Probability vs Variety Sampling
Sampling for Variety
3
Overview
1. Introduction: Probability vs Variety Sampling
2. Looking for Variety
Sampling for Variety
4
Overview
1. Introduction: Probability vs Variety Sampling
2. Looking for Variety
3. The Experiment: DV versus GM
Sampling for Variety
5
Overview
1. Introduction: Probability vs Variety Sampling
2. Looking for Variety
3. The Experiment: DV versus GM
4. Conclusions
Sampling for Variety
6
1. Probability vs Variety Sampling
Sampling for Variety
7
Linguistic data analysis
Analysis
Sampling for Variety
Report
8
Linguistic data analysis
Analysis
Report
All languages of the world: only guarantee that no existing possibility is
missed (cf extinct & future languages )
Sampling for Variety
9
Linguistic data analysis
Analysis
Report
All languages of the world:
- Impossible: gaps in description, time considerations etc.
Sampling for Variety
10
Linguistic data analysis
Analysis
Report
All languages of the world
- Impossible
Sampling for Variety
11
Tendencies & Correlations
Analysis
All languages of the world
- Impossible
- Not desired
Sampling for Variety
12
Tendencies & Correlations
Analysis
Tendencies
&
Correlations
All languages of the world
- Impossible
- Not desired
Sampling for Variety
13
Tendencies & Correlations
Analysis
Tendencies
&
Correlations
All languages of the world
- Impossible
- Not desired
Sampling for Variety
14
Exploring Existing Variety
Analysis
All languages of the world
- Impossible
- Not desired
- Not necessary
Sampling for Variety
15
Exploring Existing Variety
Analysis
Existing
Variety
All languages of the world
- Impossible
- Not desired
- Not necessary
Sampling for Variety
16
Exploring Existing Variety
Analysis
Existing
Variety
All languages of the world
- Impossible
- Not desired
- Not necessary
Sampling for Variety
17
Exploring Existing Variety
Analysis
Sampling for Variety
Existing
Variety
18
2. Looking for Variety
Sampling for Variety
19
Variety Sampling
Some methods that could be applied for finding variety:
Sampling for Variety
20
Variety Sampling
Some methods that could be applied for finding variety:
1. Tomlin (1986)
Sampling for Variety
21
Variety Sampling
Some methods that could be applied for finding variety:
1. Tomlin (1986)
2. Dryer (1989)
Sampling for Variety
22
Variety Sampling
Some methods that could be applied for finding variety:
1. Tomlin (1986)
2. Dryer (1989)
3. Nichols (1992)
Sampling for Variety
23
Variety Sampling
Some methods that could be applied for finding variety:
1. Tomlin (1986)
2. Dryer (1989)
3. Nichols (1992)
4. Bybee, Perkins & Pagliuca (1994)
Sampling for Variety
24
Variety Sampling
Some methods that could be applied for finding variety:
1. Tomlin (1986)
2. Dryer (1989)
3. Nichols (1992)
4. Bybee, Perkins & Pagliuca (1994)
…
Sampling for Variety
25
Variety Sampling
Some methods designed for finding variety:
Sampling for Variety
26
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
Sampling for Variety
27
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
- All independent families represented, extra number of languages
taken from each grouping determined by their Diversity Value
Sampling for Variety
28
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
- All independent families represented, extra number of languages
taken from each grouping determined by their Diversity Value
- No areal stratification
Sampling for Variety
29
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
- All independent families represented, number of languages
taken from each grouping determined by their Diversity Value
- No areal stratification
- Problem: dependent on the details of classifications, but these
are often uncertain and incommensurable across the world
Sampling for Variety
30
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
- All independent families represented, exrtra number of languages
taken from each grouping determined by their Diversity Value
- No areal stratification
- Problem: dependent on the details of classifications, but these
are often uncertain and incommensurable across the world
B. Miestamo (2003, 2005) – GM method
Sampling for Variety
31
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
- All independent families represented, number of languages
taken from each grouping determined by their Diversity Value
- No areal stratification
- Problem: dependent on the details of classifications, but these
are often uncertain and incommensurable across the world
B. Miestamo (2003, 2005) – GM method
- All Genera (Dryer 1989) are represented
Sampling for Variety
32
Variety Sampling
Some methods designed for finding variety:
A. Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
- All independent families represented, number of languages
taken from each grouping determined by their Diversity Value
- No areal stratification
- Problem: dependent on the details of classifications, but these
are often uncertain and incommensurable across the world
B. Miestamo (2003, 2005) – GM method
- All Genera (Dryer 1989) are represented
- Areal stratification > Macro-areas (Dryer 1989)
Sampling for Variety
33
Variety Sampling
Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
Miestamo (2003, 2005) – GM method
Comparison DV versus GM method:
Sampling for Variety
34
Variety Sampling
Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
Miestamo (2003, 2005) – GM method
Comparison DV versus GM method:
- specifically designed for variety sampling
Sampling for Variety
35
Variety Sampling
Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
Miestamo (2003, 2005) – GM method
Comparison DV versus GM method:
- specifically designed for variety sampling
- most explicit general methods available
Sampling for Variety
36
Variety Sampling
Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
Miestamo (2003, 2005) – GM method
Comparison DV versus GM method:
- specifically designed for variety sampling
- most explicit general methods available
- implemented computationally
Sampling for Variety
37
Variety Sampling
Rijkhoff & al. (1993); Rijkhoff & Bakker (1998) – DV method
Miestamo (2003, 2005) – GM method
Comparison DV versus GM method:
- specifically designed for variety sampling
- most explicit general methods available
- implemented computationally
- replicable
Sampling for Variety
38
A. Diversity Value method
Sampling for Variety
39
Diversity Value method
Basis: Tree-shaped Genealogical classification (e.g. Ethnologue-16
Ethnologue-15
WALS
Ruhlen 1991
Voegelin&Voegelin
…)
Sampling for Variety
40
Diversity Value method
Basis: Tree-shaped Genealogical classification
Basic Sample (BS): one language per family (= highest node)
Sampling for Variety
41
Diversity Value method
Basis: Tree-shaped Genealogical classification
Basic Sample (BS): one language per family (= highest node)
Ethn-15
147
WALS
209
Ruhlen
22
minimum sample size
Sampling for Variety
42
Diversity Value method
Basis: Tree-shaped Genealogical classification
Basic Sample (BS): one language per family (= highest node)
Small Sample (< minimum)
Sampling for Variety
43
Diversity Value method
Basis: Tree-shaped Genealogical classification
Basic Sample (BS): one language per family (= highest node)
Small Sample (< minimum):
Random (1 per family)
Sampling for Variety
44
Diversity Value method
Basis: Tree-shaped Genealogical classification
Basic Sample (BS): one language per family (= highest node)
Small Sample (< minimum):
Random (1 per family)
Extended Sample (> minimum)
Sampling for Variety
45
Diversity Value method
Basis: Tree-shaped Genealogical classification
Basic Sample (BS): one language per family (= highest node)
Small Sample (< minimum):
Random (1 per family)
Extended Sample (> minimum):
1 + DV value
Sampling for Variety
46
Diversity Value method
Extended Sample:
1 + DV value
DV value: weight of a family tree based on recursively
calculated complexity of all subtrees
Sampling for Variety
47
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
Sampling for Variety
48
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
Fam_k
Sampling for Variety
49
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
>
Sampling for Variety
Fam_k
50
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=3
>
Sampling for Variety
Fam_k
DV=2
51
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=3
=
Sampling for Variety
Fam_k
DV=3
52
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=3
Fam_k
DV=3
9
Sampling for Variety
53
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=3
Fam_k
DV=3
6
9
Sampling for Variety
54
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=7.5
Fam_k
DV=3
4.5
6
9
Sampling for Variety
55
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=7.5
Fam_k
DV=6
4.5
3.0
6
9
Sampling for Variety
56
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=7.5
>
Fam_k
DV=6
4.5
3.0
6
9
Sampling for Variety
57
Diversity Value method
Extended Sample:
1 + DV value
DV value:
Fam_i
DV=7.5
Fam_k
DV=6
etcetera …
Sampling for Variety
58
Diversity Value method
Extended Sample:
1 + DV value
Sampling for Variety
59
Diversity Value method
Extended Sample:
1 + DV value
Sampling for Variety
60
Diversity Value method
Extended Sample:
1 + DV value
Sampling for Variety
61
Diversity Value method
Extended Sample:
1 + DV value
Sampling for Variety
62
B. Genus Macro-area method
Sampling for Variety
63
Genus Macro-area method
Basis: Genealogical  Genera (Dryer 1989, 2005, 2008)
Sampling for Variety
64
Genus Macro-area method
Basis: Genealogical  Genera (Dryer 1989, 2005, 2008)
- Time depth comparable
Sampling for Variety
65
Genus Macro-area method
Basis: Genealogical  Genera (Dryer 1989, 2005, 2008)
- Time depth comparable
- Widely accepted
Sampling for Variety
66
Genus Macro-area method
Basis: Genealogical  Genera (Dryer 1989, 2005, 2008)
- Time depth comparable
- Relatively uncontroversial
- Relatively large size
Sampling for Variety
67
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Sampling for Variety
68
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
How to establish a GM sample of a specific size?
Sampling for Variety
69
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n
< 475
Sampling for Variety
70
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475
Sampling for Variety
71
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475
Sampling for Variety
72
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475
Sampling for Variety
73
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475 ( RANDOM MOD FAMILY )
Africa:
7 Families:
70 > 15
deleted
iteratively
from largest
family
Sampling for Variety
74
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475
Sampling for Variety
75
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475
Sampling for Variety
76
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n < 475
Sampling for Variety
77
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n
> 475
Sampling for Variety
78
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n > 475
Sampling for Variety
79
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n > 475 ( RANDOM MOD FAMILY )
Sampling for Variety
80
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n > 475 ( RANDOM MOD FAMILY )
Iteratively
assigned to
proportionally
least
represented
family
wrt genera
Sampling for Variety
81
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n > 475
Sampling for Variety
82
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n > 475
Sampling for Variety
83
Genus Macro-area method
Genus Sample (GS):
1 language per Genus (n=475)
Basic Sample (BS):
n > 475
…
Sampling for Variety
84
Variety Sampling
Variety Sampling:
which method is the best?
Sampling for Variety
85
3. The Experiment: DV versus GM
Sampling for Variety
86
DV versus GM
Sampling for Variety
87
DV versus GM
Procedure:
Sampling for Variety
88
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
Sampling for Variety
89
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
2. Compare variety for each sample Sn for DV and GM
Sampling for Variety
90
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
2. Compare variety for each sample Sn for DV and GM
3. Determine which of the two has highest average variety
Sampling for Variety
91
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
2. Compare variety for each sample Sn for DV and GM
3. Determine which of the two has highest average variety
Points:
Sampling for Variety
92
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
2. Compare variety for each sample Sn for DV and GM
3. Determine which of the two has highest average variety
Points:
a. Which classification(s) to use?
Sampling for Variety
93
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
2. Compare variety for each sample Sn for DV and GM
3. Determine which of the two has highest average variety
Points:
a. Which classification(s)?
b. Which typological variables to measure variety on?
Sampling for Variety
94
DV versus GM
Procedure:
1. Generate a sample of size S1, S2, S3, … for both DV and GM
2. Compare variety for each sample Sn for DV and GM
3. Determine which of the two has highest average variety
Points:
a. Which classification(s)?
b. Which typological variables?
c. How test?
Sampling for Variety
95
DV versus GM
a. Which classification(s)?
Sampling for Variety
96
DV versus GM
a. Which classification(s)?
DV: must have ‘depth’:
Ethnologue (#15); n of lgs = 7299
Sampling for Variety
97
DV versus GM
a. Which classification(s)?
DV: must have ‘depth’:
Ethnologue (#15); n of lgs = 7299
GM: must have genera:WALS; n of lgs = 2561
Sampling for Variety
98
DV versus GM
b. Which typological variables?
Sampling for Variety
99
DV versus GM
b. Which typological variables?
WALS database:
Sampling for Variety
100
DV versus GM
b. Which typological variables?
WALS database:
- 138 variables (phon/morph/synt/lex/…): representative
Sampling for Variety
101
DV versus GM
b. Which typological variables?
WALS database:
- 138 variables (phon/morph/synt/lex/…)
- 2 – 9 different values
Sampling for Variety
102
DV versus GM
b. Which typological variables?
WALS database:
- 138 variables (phon/morph/synt/lex/…)
- 2 – 9 different values
- value distribution frequent < - > rare
Sampling for Variety
103
DV versus GM
b. Which typological variables?
WALS database:
- 138 variables (phon/morph/synt/lex/…)
- 2 – 9 different values
- value distribution frequent < - > rare
- total 2511 languages with 1 or more values
Sampling for Variety
104
DV versus GM
b. Which typological variables?
WALS database:
- 138 variables (phon/morph/synt/lex/…)
- 2 – 9 different values
- value distribution frequent < - > rare
- total 2511 languages with 1 or more values
- value for 415 languages on average per variable
Sampling for Variety
105
DV versus GM
c. How test?
Sampling for Variety
106
DV versus GM
c. How test?
1. For each variable Vi (i = 1 - 138) in the WALS database:
Sampling for Variety
107
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of 18 sample sizes Sn (n=50, 100, 150, …, 900):
Sampling for Variety
108
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
Sampling for Variety
109
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
Sampling for Variety
110
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
Sampling for Variety
111
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
6. Randomly select a language for each node from C
Sampling for Variety
112
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
6. Randomly select a language for each node from C
7. For each language determine value (1-9) for Vi in WALS, or 0
Sampling for Variety
113
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
6. Randomly select a language for each node from C
7. For each language determine value for Vi in WALS, or 0
8. Add value to value set Vi.Sn
Sampling for Variety
114
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
6. Randomly select a language for each node from C
7. For each language determine value for Vi in WALS, or 0
8. Add value to value set Vi.Sn
9. Compare completeness Vi.Sn for DV and GM
Sampling for Variety
115
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
138 x
18 x
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
6. Randomly select a language for each node from C
7. For each language determine value for Vi in WALS, or 0
8. Add value to value set Vi.Sn
9. Compare completeness Vi.Sn for DV and GM
100 x
Sampling for Variety
116
DV versus GM
c. How test?
1. For each variable Vi in the WALS database:
2. For a series of sample sizes Sn:
18 x
138 x
3. For each method M (DV, GM):
4. Generate sample Vi.Sn on the basis of M-specific classification C
5. This gives n nodes (family, group, genus) in C
6. Randomly select a language for each node from C
7. For each language determine value for Vi in WALS, or 0
8. Add value to value set Vi.Sn
9. Compare completeness Vi.Sn for DV and GM
100 x
= 250,200 samples per method
Sampling for Variety
117
DV versus GM
Key factors (0.0 = LOW, 1.0 = HIGH)
Sampling for Variety
118
DV versus GM
Key factors
Saturation (SAT):
proportion of values for variable in a sample
0.0: no values found
1.0: all values (2-9) found
Sampling for Variety
119
DV versus GM
Key factors
Saturation (SAT):
proportion of values for variable in a sample (0.0 – 1.0)
Completeness (COMP):
number of draws necessary to find all values
0.0: maximum (=100 draws) reached
1.0: all values found in first draw
Sampling for Variety
120
DV versus GM
Key factors
Saturation (SAT):
proportion of values for variable in a sample (0.0 – 1.0)
Completeness (COMP):
number of draws necessary to find all values (0.0 – 1.0)
Several more, not discussed here …
Sampling for Variety
121
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
122
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
123
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
124
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
GM > DV
GM > DV
Sampling for Variety
125
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
5028
languages
with no
value!!!
Sampling for Variety
126
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
Only
languages
with a
value!!!
127
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
128
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
129
DV versus GM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
DV > GM
Sampling for Variety
130
DV versus GM
DV slightly better overall than GM, but:
Sampling for Variety
131
DV versus GM
DV slightly better overall than GM, but:
A. Differences per sample size?
Sampling for Variety
132
DV versus GM
DV slightly better overall than GM, but:
A. Differences per sample size?
B. Differences per feature?
Sampling for Variety
133
DV versus GM
A. Sample size:
Sampling for Variety
134
DV versus GM
A. Sample size:
1. SATuration
Sampling for Variety
135
DV versus GM
A. Sample size:
1. SATuration
Sampling for Variety
136
DV versus GM
A. Sample size:
1. SATuration
DV ≈ GM
DV > GM
Sampling for Variety
137
DV versus GM
A. Sample size:
2. COMPleteness
Sampling for Variety
138
DV versus GM
A. Sample size:
GM > DV
2. COMPleteness
DV > GM
Sampling for Variety
139
DV versus GM
A. Sample size:
GM improves around sample size 450 - 500
Sampling for Variety
140
DV versus GM
A. Sample size:
GM improves around sample size 450 - 500
GM: number of genera = 475
Sampling for Variety
141
DV versus GM
A. Sample size:
GM improves around sample size 450 - 500
GM: number of genera = 475
> support for genera?
Sampling for Variety
142
DV versus GM
B. Feature:
Sampling for Variety
143
DV versus GM
B. Feature:
Sampling for Variety
144
DV versus GM
Factor?
B. Feature:
Sampling for Variety
145
DV versus GM
How GOOD are DV and GM?
Sampling for Variety
146
DV versus GM
How GOOD are DV and GM?
Compare with RANDOM sample of same size:
Sampling for Variety
147
DV vs GM vs RANDOM
MEAN over FEATURES (138) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
148
DV vs GM vs RANDOM
MEAN over FEATURES (138) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
149
DV vs GM vs RANDOM
DV & GM > RAN
Sampling for Variety
150
DV vs GM vs RANDOM
MEAN over FEATURES (139) * SAMPLE SIZES (50, 100, … , 900)
N of draws per Feature * SampleSize: 100
Sampling for Variety
151
DV vs GM vs RANDOM
DV & GM > RAN
Sampling for Variety
152
4. Conclusions
Sampling for Variety
153
Conclusions
1. As an explorative device, Variety Sampling works
better at any common sample size (50-900) than
Random Sampling:
Sampling for Variety
154
Conclusions
1. As an explorative device, Variety Sampling works
better at any common sample size (50-900) than
Random Sampling:
a. to find maximum variety
Sampling for Variety
155
Conclusions
1. As an explorative device, Variety Sampling works
better at any common sample size (50-900) than
Random Sampling:
a. to find maximum variety
b. to establish it relatively easily
Sampling for Variety
156
Conclusions
1. As an explorative device, Variety Sampling works
better at any common sample size (50-900) than
Random Sampling:
a. to find maximum variety
b. to establish it relatively easily
(c. all other measures …)
Sampling for Variety
157
Conclusions
2. A purely genealogically based method (DV) works
slightly better for smaller samples (< 500) than a
method which combines a genealogical basis with
areal stratification (GM).
Sampling for Variety
158
Conclusions
2. A purely genealogically based method (DV) works
slightly better for smaller samples (< 500) than a
method which combines a genealogical basis with
areal stratification (GM).
For larger samples both methods are equally good,
but areal stratification may make it easier to find
the optimal sample.
Sampling for Variety
159
Conclusions
3. Unclear why areal stratification does not simply
improve genealogical sampling
Sampling for Variety
160
Conclusions
3. Unclear why areal stratification does not simply
improve genealogical sampling
- Macro-Areas too crude  AutoTyp areas?
Sampling for Variety
161
Conclusions
3. Unclear why areal stratification does not simply
improve genealogical sampling
- Macro-Areas too crude  AutoTyp areas?
- Areal under/overrepresentation of Genera
Sampling for Variety
162
Conclusions
3. Unclear why areal stratification does not simply
improve genealogical sampling
- Macro-Areas too crude  AutoTyp areas?
- Areal under/overrepresentation of Genera
- Try other diversity criteria (e.g. Dahl 2008)
Sampling for Variety
163
Conclusions
3. Unclear why areal stratification does not simply
improve genealogical sampling
- Macro-Areas too crude  AutoTyp areas?
- Areal under/overrepresentation of Genera
- Try other diversity criteria (e.g. Dahl 2008)
GOAL: find optimal balance Areal vs Genealogical
stratification
Sampling for Variety
164
?
Sampling for Variety
165