resolves - VideoLectures

Diversity of
User Activity and Content Quality
in Online Communities
Tad Hogg and Gabor Szabo
HP Labs
thanks to: C. Chan and J. Kittiyachavalit (Essembly)
M. Brzozowski and D. Wilkinson (HP)
online communities
essembly
Bugzilla
delicious
“wisdom of crowds”
Why model online communities?
• predict
• e.g., which new content will become popular?
• design web sites
• e.g., what to show users?
• encourage high-quality contributions
• e.g., what incentives?
heterogeneity is pervasive
number of cases
• most activity from a few ‘top users’
• most interest in small fraction of content
• broad, long-tail distributions
typical << average << maximum
activity
topics
• case study: Essembly
• user activity
• content ratings
What is Essembly?
• political discussion web site
– help people identify others with similar views
– self-organize for political activity
Essembly: resolves
• users create resolves
– e.g., “free trade is good for American workers”
• other users vote & comment
– 4-point scale
• agree, lean agree, lean against, against
Why study Essembly?
• voting history since start of site
• modest-sized community
– can examine all users and content
• useful to study diversity
• distinct link semantics
– friend, ally, nemesis
• similar diversity as other communities
– Digg, Wikipedia, …
users active
each month
data set
Aug. 2005 to
Dec. 2006
•
•
•
•
•
15,424 users
24,953 resolves
1.3 million votes
networks
comments
new resolves
number
of votes
each month
each month
50 new resolves
per day
data limitations
• anonymous
– no user characteristics
• e.g., demographics, political party, …
– no content of resolves or comments
• e.g., political topic area
– environment, economics, foreign aid,….
• hence:
– can’t test if characteristics explain diversity
user privacy vs. research usefulness
• no info on
– which resolves users view (but don’t vote on)
– how users find resolves (e.g., via networks)
topics
• case study: Essembly
• user activity
• content ratings
user activity
4741 active users
with at least one action
actions: create a resolve, vote on a resolve, form a link
user model
inactive: no activity for at least 30 days
• how long user is active
(conventional, but somewhat arbitrary, definition)
• how often user contributes while active
create
vote
active user
inactive user
link
correlation between activity time and rate: -0.07
model as independent components of user behavior
caveat: users active only a short time have larger (negative) correlation: -0.2
user model
this model: consider whether user votes on resolve
not how user voted (agree,…,disagree) or comments
create
vote
active user
inactive user
link
note: how users vote correlates with link type (friend, ally, nemesis)
M. Brzozowski et al., "Friends and Foes: Ideological Social Networking", Proc of CHI 2008
user activity
model components
• activity time
• activity rate
activity time distribution:
stretched exponential
for users active at least 1 day
 diverse time scales for user participation
 users active a long time less likely to quit in next day than new users
applies to many online communities [Wilkinson 2008]
user activity
model components
• activity time
• activity rate
activity rate distribution: lognormal
actions: create a resolve, vote on a resolve, form a link
normal distribution
fit to log(ρ) values
2 months/action
60 actions/day
natural logarithm of actions per day
user activity
• activity time
• activity rate
• combined model
user activity distribution
• product: (activity time) x (activity rate)
mismatch for small number of actions
negative correlation of time and rate for less active users
e.g., a few actions to “try out” the site over a day or so
4741 active users
with at least one action
model captures diversity
of action counts,
but not bursts of activity
(“sessions” of ~3 hours
with longer breaks)
What determines user activity?
• diversity from two underlying broad distributions:
– activity time (stretched exponential)
• multiple time scales for losing interest in site
– activity rate (lognormal)
• multiplicative process leading to activity rate heterogeneity
• open question:
– What user characteristics and community properties
produce these distributions?
activity time:
prior interest or experience?
utility
“nature”
time user is active
initially heterogeneous
cohort increasingly dominated by
high-utility users
who are less likely to quit
utility
“nurture”
time user is active
initially homogeneous
change due to experience on site
cohort increasingly dominated by
users with good experience
who are less likely to quit
How to encourage participation?
• “nature”
– attract users whose interests fit the community
– expose potential users to site, word of mouth, …
• “nurture”
– improve rewards of use to keep people engaged
– “top contributor” status, niche subgroups, …
topics
• case study: Essembly
• user activity
• content ratings
votes on resolves
24953 resolves
similar broad distribution in other online communities
Digg, Wikipedia,… [Wilkinson 2008]
vote model
• visibility
– how easily users find a resolve
• interestingness
– probability users who see a resolve vote on it
user comes to Essembly
see the
resolve?
yes
vote on
the resolve?
similar model for Digg [Lerman 2007]
content ratings
model components
• visibility
• interest
visibility:
how users find content
• browse
– e.g., recent or popular
• in general and within online network
• word of mouth
– from people aware of, and liking, the content
• e.g., link on a blog
• search
visibility distribution: power-law
• recency is key factor for visibility in Essembly
• contrast with controversy (standard dev. of votes):
not correlated with number of votes
large drop in visibility
from user interface
fewer votes to older resolves
“law of surfing” [Huberman et al. 1998]
approximately a power law
(number of subsequently introduced resolves)
content ratings
model components
• visibility
• interest
interestingness:
how much users like what they see
• persistent property of resolves
– resolves consistently get few or many votes
compared to average at similar age
• may have time dependence
– novelty decay [Wu & Huberman 2007]
• e.g., current news stories (Digg)
•
vs. ideological discussions (e.g., free trade)
model parameter estimation
• model:
– visibility based on recency
– next vote goes to resolve x with relative probability rx f(ax)
• r is resolve’s interestingness
• a is resolve’s age
– number of subsequently introduced resolves
• simultaneously estimate
– ‘aging’ visibility function f(a)
– interestingness for resolves: r1,r2,…
– arbitrary scale factor for f and r
• we take f(1)=1
interestingness distribution:
lognormal
normal distribution
fit to log(r) values
growth in number of votes
for high and low interestingness
log scale
two examples
r=0.65
r=0.01
(number of subsequently introduced resolves)
content ratings
• visibility
• interest
• combined model
vote distribution
• sample at different ages from a multiplicative
process: double Pareto lognormal distribution
Reed & Jorgensen 2004
lognormal
center
power law tails
24953 resolves
What determines content value?
• lognormal  multiplication of factors
• possible mechanisms
– “rich get richer”
– “inherited wealth”
– or a mix of both
model: visibility and interest lead to votes
votes increase visibility
(“popular resolves”)
votes
visibility
user comes to Essembly
see the
resolve?
interest
yes
vote on
the resolve?
votes  more votes
votes
“rich get richer”
• new votes
visibility
– proportional to number of prior votes
• with some variation
• influenced by observed popularity
– among all users or just friends
– examples
• costly to evaluate content personally
• ‘fashion’, latest ‘cool’ product
interest
match user interests
votes
“inherited wealth”
• new votes
visibility
interest
– from matching users’ prior interests
• with some variation
– e.g. popular vs. niche political topics
– why a broad distribution?
• possibly: information cascade & confirmation bias
• M. Shermer “The Political Brain” Scientific Amer. July 2006
• S. Bikhchandani et al., “A Theory of Fads …” J. Political
Economy 100:992 (1992)
topics
•
•
•
•
case study: Essembly
user activity
content ratings
additional behaviors
predictions from early behavior
• model can identify
– new users likely to be very active
– new resolves likely to have high interest
• by factoring
– web site properties (visibility)
– user properties (interest in content)
• also with other sites: Digg, YouTube
– e.g., [Crane & Sornette 2008; Lerman & Galstyan 2008; Szabo &
Huberman 2008]
number of links per user
• model: links due to common votes
– as intended to link ideologically similar users
• caveat: linked users also share visibility  votes
degree
distribution
Hogg & Szabo, in Europhysics Letters (to appear)
Do active users
create interesting resolves?
r vs. user activity rate
r vs. user activity time
(actions/day)
1827 active users who introduced at least one resolve
little correlation between
a user’s activity and
interestingness of resolves from that user
future work & summary
distinguishing mechanisms
(future work)
• experiments
– alter information shown to random groups of users
• can change both visibility and popularity measures
• e.g., music downloads [Salganik et al, 2006]
– correlation  causal factors
• do votes depend on how users find content?
– e.g., influence of friends
• relate to characteristics of content and users
summary
• heterogeneous behavior
– user activity
– interest in content
• model via components of behavior
– steps toward identifying mechanisms
• example: political discussion Essembly
– user activity: time on site & activity rate
– votes: visibility & interestingness
• experiments to distinguish mechanisms