Links among personal homepages at MIT and Stanford

You Are What You Link
Lada Adamic
Eytan Adar
WWW 10 – May, 2001
Outline
Graph structures of social networks
•How person to person links on the web create
observable social networks
Understanding and predicting links
•Additional online info (text, links, email subscriptions)
gives context to social links
•Predict social links even where there is no explicit
hyperlink.
Understanding communities through links
Julie
Hi, I’m Julie!
I’m studying...
I like ...
My friends are...
My favorite links:
Becky
Hey, I’m Becky.
I study...
I live in ...
My favorite books
are...
Here are some
photos...
Becky and Julie aren’t the only
ones to link to each other
Stanford Social Web
Graph Structure of
Social Networks
Differences in cohesiveness of communities
Stanford
MIT
Links among personal homepages at MIT and Stanford
MIT
Stanford
Users with non-empty WWW
directories
2302
7473
Percent with links in either direction
69%
29%
Percent with links in both directions
22%
7%
The number of links/person is uneven
number of users
10
10
10
given
received
undirected
3
2
1
0
10
0
10
1
10
number of links to or from users
Interesting social networks analysis
Largest connected component
MIT: 86%
Stanford: 58%
Shortest path from one person to another
MIT: 6.4 hops
Stanford: 9.2 hops
Clustering Coefficient
# of links among neighbors
C=
max # links among neighbors
3
C=
4*3/2
MIT:
Stanford:
=
1
2
0.22
0.21
70x that of a random graph!
Understanding and Predicting Links
Information available online
email list
common text
common text
outlink
outlink
How information was collected
User’s web directories were crawled
Outlinks were extracted
Text was passed through ThingFinder to
extract things like
people, places, companies
Mailing list subscriptions were obtained from the mailing
list servers (95% public for Stanford, internal to MIT)
Inlinks were obtained by querying search engines:
Google for Stanford
AltaVista for MIT (equivalent urls)
Comparison with traditional means of
gathering information on social networks
Advantages
Easily and automatically gathered (no phone, live,
or mail surveys).
Data sets are orders of magnitude larger.
Information is already public.
Disadvantages
Data sets are incomplete
i.e. you don’t get to ask the questions, just
take down the answers
Friends have more in common
I love Prince!
Prince is the
coolest!
I play basketball
I live in Terra
House
Find me in Terra.
I live in Kimball
I play volleyball
Wanna play
volleyball?
I play a lot of computer
games
user 1: kpsounis
user 2:stoumpis
Konstantinos Psounis
Stavros Toumpis
Things in common
CITIES:
NOUN GROUPS:
MISC:
COUNTRIES:
Escondido, Cambridge, Athens
birth date, undergraduate studies, student association
general lyceum, NTUA, Ph.D., electrical engineering, computer
science, TOEFL, computer
Greece
Out links in common
http://www.stanford.edu/group/hellas
http://www.kathimerini.gr
http://ee.stanford.edu
http://www.ntua.gr
Hellenic association
Athens news
Electrical Engineering Department
National Technical University of Athens
In links in common
http://www.stanford.edu/~dkarali
http://171.64.54.173/filarakia.html
Dora Karali's homepage
Dimitrios Vamvatsikos friends list
Mailing lists in common
greek-sports
hellenic
ee261-list
ee376b
Soccer/Basketball mailing lists for members of Hellas
Hellenic association members
Fourier transform class list
Information theory class list
http://negotiation.parc.xerox.com/web10/
So can we guess who’s friends with whom from
the information gathered online?
• Choose person A
• Rank everybody else according to their likeness to that
person
• See how “friends” (people who are linked to A) were ranked.
• Evaluate for text, outlinks, inlinks, mailing lists separately
1
likeness ( A, B)  
shareditems log[ frequency( shareditem )]
Example, top matches for a particular user
annaken: Clifford Hsiang Chao
Linked (friends)
Likeness score
Person
NO
8.25
Eric Liao
YES
3.96
John Vestal
NO
3.27
Desiree Ong
YES
2.82
Stanley Lin
NO
2.66
Daniel Chai
NO
2.55
Wei Hsu
YES
2.42
David Lee
NO
2.41
Byung Lee
Coverage in ability to predict user-user links
i.e. friends had at least one item in common
Method
Pairs ranked
Stanford
Pairs ranked
MIT
inlinks
24%
17%
outlinks
35%
53%
mailing lists
53%
41%
text
53%
64%
Performance of friend matching algorithm
350
in link
out link
mailing list
thing
300
frequency
250
200
Stanford
method
average rank
inlinks
6.0
outlinks
14.2
mailing lists
11.1
text
23.6
150
100
50
0
1
2
3
4
5
6
7
8
9
10
rank
200
in link
out link
mailing list
thing
180
160
The most common ranking for
a friend is #1
method
average rank
inlinks
9.3
outlinks
18.0
mailing lists
22.0
text
31.6
frequency
140
120
MIT
100
80
60
40
20
0
1
2
3
4
5
6
rank
7
8
9
10
Stanford
we don’t have that much in common with our
friend’s friend’s friends
Understanding Communities Through Links
What are good and bad link predictors?
• What you would expect…
• Very unique things are only relevant to individuals
• Very general things (“MIT” “Stanford”) are relevant to
everyone
• Some top 10 lists…
Text Based Predictors
MIT Top Things
Stanford Top Things
Union Chicana (student group)
NTUA (National Technical University of Athens)
Phi Beta Epsilon (fraternity)
Project Aiyme (mentoring Asian American 8th graders)
Bhangra (traditional dance, practiced within a club at
MIT)
pearl tea (popular drink among members of a sorority)
neurosci (appears to be the journal Neuroscience)
clarpic (section of marching band)
Phi Sigma Kappa (fraternity)
KDPhi (Sorority)
PBE (fraternity)
technology systems (computer networking services)
Chi Phi (fraternity)
UCAA (Undergraduate Asian American Association)
Alpha Chi Omega (sorority)
infectious diseases (research interest)
Stuyvesant High School
viruses (research interest)
Russian House (living group)
home church (Religious phrase)
• Bad phrases: general organizations, cities (Oakland,
Cambridge, etc), departments (CS)
Out-link Based Predictors
MIT Top Out-links
Stanford Top Out-links
MIT Campus Crusade for Christ*
alpha Kappa Delta Phi (Sorority)*
The Church of Latter Day Saints
National Technical University Athens
The Review of Particle Physics
Ackerly Lab (biology)*
New House 4 (dorm floor, home page)*
Hellenic Association*
MIT Pagan Student Group*
Iranian Cultural Association*
Web Communication Services*
Mendicants (a cappella group)*
Tzalmir (role playing game)*
Phi_Kappa_Psi (fraternity)*
Russian house (living group) comedy team *
Magnetic Resonance Systems Research Lab*
Sigma Chi (fraternity)*
Applications assistance group*
La Unión Chicana por Aztlán
ITSS instructional programs*
• Worst ranked sites are search engines and portals
(Altavista, Lycos, Yahoo, etc.), and top level homepages
such as www.mit.edu and www.stanford.edu.
In-link Based Predictors
• The top predictors are almost exclusively individual
home pages pointing to lists of friends
• Poor predictors: Long lists (all homepages, department
listings)
Mailing List Based Predictors
MIT Top Mailing Lists
Stanford Top Mailing Lists
Summer social events for residents of specific dorm
floor
Kairos97 (dorm)
Religious group
mendicant-members (a cappella group)
Religious group
Cedro96 (dorm summer mailing list)
Religious group
first-years (first year economics doctoral students)
Intramural sports team from a specific dorm
local-mendicant-alumni (local a cappella group
alumni)
Summer social events for residents of specific dorm
floor
john-15v13 (Fellowship of Christ class of 1999)
Religious a cappella group
stanford-hungarians (Hungarian students)
Intramural sports team from a specific dorm
serra95-96 (dorm)
“…discussion of MIT life and administration.”
metricom-users (network services employees who use
metricom)
Religious group
science-bus (science education program organized by
engineering students)
• Bad lists: General announcement lists at MIT, nonhousing based activities (theater), job lists
Future Work
• Use other pieces of available information
• demographic information (where people live,
department, year, etc.)
• combine information
• Label structures (Flake, et. al. 2000)
• Given structures determined by graph algorithms
• Label them using extracted information
Summary
• Homepage graph structure varies depending on
community
• Possible to predict (to some degree) where links
will exist
• Good predictors seem unique to communities