Popularity of Each Roll - Dordt College Homepages

Choose n out of these k objects
For example:
 Choose your three favorites out of these ten
photographs
 Of these fifty apps, which ten would you
download to your phone?
 Which two of these seven movies would you
want to watch?
Could we prove that that there is dependence
within each person’s choices? For example, do
people have a certain “taste” in sushi rolls?
Objectives:
 We wanted to prove that each person does
not choose randomly. Some items are chosen
together more often than they would be
otherwise.
 In particular, we wanted to find which items
are similar to one another. If a person chooses
a given object, which other objects is he also
more likely to choose?



SUSHI Preference Data Set -- survey taken by
5000 people in which they were asked to rank
ten different types of rolls from best to worst
(http://www.kamishima.net/sushi/ )
The ten rolls: shrimp (0), sea eel (1), tuna (2),
squid (3), sea urchin (4), salmon (5), egg (6),
fatty tuna (7), tuna roll (8), cucumber (9)
We just looked at each respondent’s first three
choices and ignored the order in which they
listed them. (This way, the data fit our “choose
n out of k” format.)
The following is a matrix of how often each pair of
sushis appeared together in someone’s top three:
0
Most
popular
pairs
1
2
3
4
5
6
7
8
0
x
1
394
x
2
451
411
x
3
369
223
299
x
4
398
519
373
193
x
5
395
387
412
191
806
x
6
227
238
144
114
62
140
7
774
925 1412 421 1296 1148 253
8
129
156
404
122
88
134
74
468
x
9
83
43
42
52
21
47
66
61
35
9
Least
popular
pair
x
x
x
Doesn’t this answer our questions?
 The most popular pairings were (2,7) and
(4,7). So those who like roll #7 were more
likely to choose roll #2 or #7.
 The least popular pairing was (5,9) – only
21 respondents listed them as two of their
top three! They must be very dissimilar.
There’s no clear proof that
these pairings tell us
anything about people’s
taste – they may just reflect
each roll’s popularity.
Popularity of Each Roll
4000
3500
3000
# of Votes
That ignores the fact that
some rolls were just more
popular overall. It makes
sense that (2,7) and (4,7)
were chosen together so
often since 2, 4, and 7 were
popular overall. The reverse
is true for 5 and 9.
2500
2000
1500
1000
500
0
0 1 2 3 4 5 6 7 8 9
Roll
We needed to generate a matrix of how
often each pair of rolls would be expected
to appear together. We could then
compare the actual results to the
expected results.
To generate this matrix, we decided to run
a simulation.
Each respondent needs to randomly
choose three rolls
 The rolls must be chosen without
replacement – each respondent needs
to choose three different rolls
 Each roll’s overall popularity must be
held fixed


Simply choose three rolls out of ten without replacement,
using sample(0:9,3,replace=FALSE,prob=P1,P2,…)in R

Imagine that a number line between 0 and 3 is split up into 10
parts where the size of each part is proportional to the
frequency of each subsequent roll.
A random number between 0 and 3 is then generated,
corresponding to one of the rolls. For example, if 1.4 was
generated, then roll #4 would be chosen.


A new number line is then drawn, leaving out whichever roll
was chosen the first time, while proportionally increasing the
size of each remaining part. For example, this would be the
new number line if #4 were chosen:

Once again, a number between 0 and 3 would be chosen,
corresponding to the second roll chosen.
This same process would be repeated to choose the third roll.



We have to redraw the number line after the first choice. As a
result, the probabilities for the second and third choices are
not the same as the overall probabilities.
The overall distribution of choices from the simulation is not
equal to the overall distribution of choices from the actual
survey:
0
1
2
3
4
5
6
7
8
9
Actual Frequency
0.107 0.110 0.132 0.066 0.125 0.122 0.044 0.225 0.054 0.015
Simulated
0.111 0.113 0.132 0.072 0.126 0.124 0.049 0.195 0.060 0.017
How can we fix this? We somehow need to keep
the overall probabilities constant for each choice,
while still not allowing for repeats.
Hartley and Rao (1962) describe an approach to solve this
problem:
1. Randomize the order of the rolls. This was
accomplished by calling sample(0:9) in R.
2.
Split up the number line between 0 and 3 into 10 parts
where the size of each part was still proportional to the
frequency of each subsequent roll, but using the new
order.
For example, when the new order of the roll is
[3,7,5,9,1,2,4,0,8,6] we use the following number line:
3.
4.
A random number between 0 and 1, d, is chosen.
The three rolls selected are the ones corresponding to
d, d+1, and d+2.
In the following example d = .95, meaning that rolls 5, 2,
and 6 – the rolls corresponding to .95, 1.95, and 2.95 – are
chosen.
Our simulation shows that each roll is
chosen with the same frequency using this
technique as in the actual survey.
0
1
2
3
4
5
6
7
8
9
Actual Frequency
0.107 0.110 0.132 0.066 0.125 0.122 0.044 0.225 0.054 0.015
Technique #2
0.107 0.110 0.132 0.066 0.125 0.122 0.044 0.225 0.054 0.015
Technique #1
0.111 0.113 0.132 0.072 0.126 0.124 0.049 0.195 0.060 0.017
Using this second method, we found our matrix of expected
results. The fact that our expectations were so different from the
actual data implies that people don’t make their choices
independently.
1
0
0
x
2
3
4
5
1
387.57
x
2
470.63
481.2
3
221.9
227.63 262.73
4
441.5
455.07 552.37 249.87
5
432.27
441.6
6
145.63 144.43 176.53
7
911.9
935.73 1196.3 550.13 1128.4 1080.9
8
177.03
180.2
9
49.1
6
7
8
9
x
x
x
529.13 245.13 507.77
214.43
81.5
116.4
167.7
208.6
50.233 51.733 23.767 52.333
x
163.43
x
367
x
200.97 48.367 456.23
51.2
18.667 124.47
x
20.3
x
We generated the residual matrix using the
formula
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 .
 The residuals serve as measurements of
similarity. A large positive residual means
that the two rolls are similar and were
chosen together more often than would
have been expected.
 The opposite is true for a large negative
residual.

0
1
2
3
4
5
6
7
8
0
x
1
0.327
x
2
-0.905
-3.2
x
3
9.875
-0.307
2.237
x
4
-2.07
2.997
-7.632
-3.598
X
5
-1.792
-2.598
-5.092
-3.458
13.235
x
6
6.742
7.786
-2.449
3.6
-8.162
-1.833
x
7
-4.567
-0.351
6.236
-5.506
4.989
2.041
-5.951
x
8
-3.61
-1.803
12.95
0.519
-8.35
-4.724
3.686
0.551
x
9
4.838
-1.021
-1.353
5.791
-4.331
-0.587
10.96
-5.689
3.26
9
x
*Remember how 2 and 7 initially seemed to be the most similar
pair? It still looks like they are similar, but there are many other
pairings which are much more similar. For example, 6 and 9 were
chosen together only 66 times yet has a larger residual!




To convert the residual matrix into a distance matrix,
we needed to make all the values positive. We did
this by setting distance equal to (15 − 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙).
To visualize this matrix, we ran multidimensional
scaling (MDS).
MDS attempts to set a point for each roll such that
the distance between any two points is proportional
to the distance between the corresponding rolls.
These points are then plotted on an (x,y) axis so the
results can be seen more easily.
Essentially, the n objects are first plotted in (n-1)dimensional space so that the distances between all
points are perfect. This is then “scaled down” to two
dimensions.
0 - shrimp
1 - sea eel
2 - tuna
3 - squid
4 - sea
urchin
5 - salmon
6 - egg
7 - fatty
tuna
8 - tuna roll
9cucumber
To further support these results, we re-ran the analysis by looking
at each respondent’s top five choices. These were the results of
the new multidimensional scaling:
The fact that this plot is so similar to our prior one (see previous
slide) proves that our results were not merely a result of the fact
that we arbitrarily chose to look at the top three choices and
that any value of k and n (where k<n) should work.
The groupings made by the MDS make sense when we look back
at what each type of roll was.
Look at the clusters it formed:
 6 and 9  Egg and Cucumber, the two nonfish choices
 2, 7, and 8  All three are different types of
tuna rolls
Since those clusters make sense on their own, and
were confirmed by our statistical analysis, we
could also trust the other clusters we formed:
 4 and 5  Sea Urchin and Salmon
 0, 1, and 3  Shrimp, Sea Eel, Squid





In our study, we looked at associations in
choice data using simulations.
The simulation was done by sampling
without replacement yet still proportional to
size.
We showed that people did not make their
choices randomly.
MDS and clustering based on the identified
associations revealed the specifics of
people’s taste.
This general approach can be readily
applied to other choice data.