Analysis of ClickDiary Data: Some Initial Results

Analysis of ClickDiary Data: Some Initial Results
Tso-Jung Yen
Institute of Statistical Science
Academia Sinica
[email protected]
Joint work with Ta-Chien Chan, Yang-chih Fu, and Jing-Shiang Hwang
July 7, 2015
Outline
•
•
•
Introduction
•
Definition of the egocentric network
•
Data collection
Data analysis:
•
Main hypothesis
•
Model building strategy
•
Results
Discussion
Egocentric Networks
•
What is an egocentric network?
•
One node called ego is at center, surrounded by other
nodes called alters.
•
From a graph viewpoint: An egocentric network is a
graph G = (V, E) with
V = {{ego}, {alter 1, alter 2, · · · , alter n}},
being the node set and E = {Eego,alter , Ealter,alter } being
the link set, where
Eego,alter = all links connecting alters to ego,
Ealter,alter = all links connecting alters to alters.
Egocentric Networks
Figure: Graphical representation of an egocentric network.
Egocentric Networks
Figure: Graphical representation of an egocentric network.
Egocentric Networks
Figure: Graphical representation of an egocentric network.
Egocentric Networks
Figure: Graphical representation of an egocentric network.
Data Collection
• Collecting egocentric network data:
•
Position generators (Lin and Dumin, 1986; Lin et al.,
2001).
•
Name generators (Laumann, 1973; Wellman, 1979).
•
Contact diary (de Sola Pool and Kochen, 1978; Freeman
and Thompson, 1989; Fu, 2007; Chan et al., 2015):
•
Collects egocentric network data via self-reporting.
•
Collects egocentric network data on a daily basis.
Data are in longitudinal format.
Data Collection
•
ClickDiary: An online platform for collecting egocentric
network data using the contact diary method.
•
Health diary (Personal health information on a daily
basis):
•
•
(1) sleep, (2) emotion, (3) dietary, (4) exercise, (5)
flu symptoms, (5) number of contacted people and
physical distance from home, (7) blood pressure,
and (8) weight.
Contact diary (Personal contact information on a daily
basis):
•
(1) Contact type, content, time, duration, and
location, (2) active or passive, (4) feel beneficial or
not, (3) emotion change, (4) health information
(e.g. flu-like symptoms).
Data Collection
• Collected between May 1, 2014 and October 31, 2014 (184
days).
• Hierarchical data (3 levels):
•
# of egos: 130.
•
# of alters: 13,409.
•
# of contacts: 110,394.
• # of alters in each egocentric network: minimum is 3,
maximum is 1115, and mean is 103.15.
• # of contacted days: minimum is 1, maximum is 184, and
mean is 8.233.
Hypothesis
• Claim:
•
Quality of contact between ego and an alter is
associated with the alter’s network position in the ego’s
personal network (egocentric network).
•
Theories suggest this claim:
•
Theory of the weak ties (Granovetter, 1973).
•
Theory of structural holes and embeddedness
(Burt, 2001; 2004; 2009).
Model
• Quantify the weak tie:
•
Ego i has a weak tie to alter j if i is not familiar with j
(also include those whom ego i did not know previously).
Model
• Quantify embeddedness of an alter in an egocentric
network:
• Normalized embeddedness score based on the strong ties:
NESSj =
# of alters with whom alter j is familiar
.
# of alters − 1
• Normalized embeddedness score based on the weak ties:
NESW
j =
# of alters with whom alter j knows but is unfamiliar
.
# of alters − 1
• NES scores quantify the proportion of alters whom alter j and
the ego have known in common (A measure of triadic closure).
Model
• Quantify quality of contact:
•
To what extent did you (ego) feel beneficial when
contacting the alter:
•
(1) lost (0.6%); (2) almost none (35.7%); (3)
somewhat beneficial (46.0%); (4) very beneficial
(17.7%).
Model
• Quantify quality of contact (contd):
•
The dependent variable:
Yijl = I{ego i felt very beneficial after contacting alter j
on record l}.
Here l is an index for the contact record between ego i
and alter j.
• Weak ties may play important roles in achieving great gains,
e.g. job finding, but may not be that important in achieving
small or moderate gains.
(1)
Model
• Assume
logit[P(Yijl = 1)] = β0 +
X
p
αk Xijlk
k=1
S
+β1 WeakTieij + β2 NESW
ij + β3 NESij + θi ,
where
•
Xijlk ’s are controlled variables.
•
WeakTiej is an indicator whether alter j is weakly tied to
ego i.
•
NESW
ij is the normalized embeddedness score based on
the weak ties.
•
NESSij is the normalized embeddedness score based on
the strong ties.
•
θi is the random intercept associated with ego i.
Model
• List of controlled variables:
•
(1) Did ego i feel very beneficial when last time
contacting alter j?
•
(2) Was this contact initiated by ego i? (3) Was this
contact face-to-face? (4) Did the contact last longer
than 1 hr?
•
(5) Homophily in sex and (6) homophily in age.
•
What is ego i’s relationship with alter j? (11 types)
•
How long has ego i known alter j? (5 levels)
•
How frequently does ego i contact alter j? (5 levels)
•
How likely does ego i discuss important issues with alter
j? (5 levels)
Model
• Model estimation strategies:
•
Scenario I: Model estimation using truncated
sample: Egos whose cumulative number of contacted
alters over a certain period are too few (the last 10%)
are dropped from the sample.
•
Scenario II: Model estimation using subsample (of
the truncated sample): Contact records of spouses,
parents, children and boy friends/girl friends are excluded
from the data set.
•
Contacting spouses, parents, children, boy friends/girl
friends usually generate non-instrumental gains.
Removing these contacts allows us to detect the weak tie
effect and embeddedness effect on instrumental gains.
Model Estimation
Figure: Scatter plot of the cumulative number of contacted alters vs the
number of contact days. Number of egos M = 130.
Model Estimation
Figure: Scatter plot of the cumulative number of contacted alters vs the
number of contact days. Number of egos M = 130.
Model Estimation
Truncated sample
Subsample∗
# of egos (M )
115
115
# of alters (K)
13,091
12,563
# of contact records (
105,775
91,376
Table: Basic statistics for the two scenarios. ∗ means that sample
excluding contacts with members of immediate family and partners.
Model Estimation
Truncated sample
Subsample∗
Yijl = 0
86,361 (82.6%)
75,204 (82.3%)
Yijl = 1
19,414 (18.4%)
16,172 (17.7%)
Total (N )
105,775 (100%)
91,376 (100%)
Table: Basic statistics of the dependent variables for the four scenarios. ∗
means that sample excluding contacts with members of immediate family
and partners
Regression Results
Figure: The probabilities of feeling beneficial after contacting the alter.
The results are based on the logistic regression model estimated from the
ClickDiary data (Number of egos M = 115; Number of alters
K = 13, 091; Number of contact records N = 105, 775).
Regression Results
Figure: The probabilities of feeling beneficial after contacting the alter.
The results are based on the logistic regression model estimated from the
ClickDiary data (Number of egos M = 115; Number of alters
K = 12, 563; Number of contact records N = 91, 376).
Discussion
• Possible sources of bias:
•
NES scores only count interpersonal relations within
ego’s personal network. They do not consider
interpersonal relations outside ego’s personal network.
•
Frequently contacted alters count most. This may result
in the loophole of ”self-reinforcement” that alters are
contacted often because ego can feel beneficial after
contacting them.
•
Enthusiastic respondents count most. They are minority.
This results in highly unbalanced data structure.
• Modeling strategy: Ordinal regression.
• Estimation technique: Inverse probability weighted methods
(Horvitz and Thompson, 1952; Robbins et al. 1995; Tsiatis,
2006).
ClickDiary App
Figure: Available from January 2015.