Sampling online communities: using triplets as basis for a

Sampling online communities: using triplets as basis
for a (semi-) automated hyperlink web crawler.
Yoann VENY
Université Libre de Bruxelles (ULB) - GERME
[email protected]
This research is funded by the FRS-FNRS
Paper presented at the 15th General Online Reasearch Conference, 4-6 march, Mannheim
Online communities – a theoretical definitions
• What is an online community?
• “social aggregations that emerge from the Net when enough
people carry on those public discussions long enough, with
sufficient human feeling, to form webs of personal
relationship in cyberspace” » (Rheingold 2000)
• long term involvement (Jones 2006)
• sense of community (Blanchard 2008)
• temporal perspective (Lin et al 2006)
• Probably important … but the first operation should be to
take into account the ‘hyperlink environment’
 Graph analysis issue / SNA issue
Online Communities – A graphical definition (1)
• Community = more ties among members than with non-members
• three general classes of ‘community’ in graph partitioning algorithm
(Fortunato 2010) :
– a local definition: focus on the sub-graphs (i.e.: cliques, n-cliques (Luce,
1950), k-plex (Seidman & Foster, 1978), lambda sets
(Borgatti et al, 1990), … )
– a global definition: focus on the graph as a whole (observed graph
significantly different from a random graph (i.e.:
Erdös-Rényi graph)?)
– vertex similarity: focus on actors (i.e.: euclidian distance & hierarchical
clustering, max-flow/min-cut (Elias et al, 1956; Flake
et al, 2000)
Online communities – graphical definition (2)
• 2 main problems of graph partitionning in a hyperlink
environment:
• 1) network size / and form (i.e. tree structure)
• 2) edges direction
•  better discover communities with a efficient web crawler
Web crawling - Generalities
• The general idea for a web
crawling process:
- We have a number of
starting blogs (seeds)
- All hyperlink are retrieved
from these seeds blogs
- For each new website
discovered, decide wether
this new site is accepted or
refused
- If the site is accepted, it
become a seed and the
process is reiterated on this
site.
Source: Jacomi & Ghitalla (2007)
Web crawling – constrain-based web crawler (1)
• Two problems of a manual crawler :
• Number and quality of decision
• Closure?
• A solution: taking advantage of local structural properties of a
network:
 Assume that a network is an outcome of the agregation of local
social processes:
– Examples in SNA:
• General philosphy of ERG Models (see f.e. : Robins et al 2007)
• Local clustering coefficient (see f.e. : Watts & Strogatz, 1998)
 Constrain the crawler to identify local social structures (ie: triangles,
mutual dyads, transitive triads,…
Web crawling – constrain-based web crawler (2)
Generalisation
Let 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the general graph of all the hyperlink
environment, where 𝑉𝐺 are the vertices of the graph and 𝐸𝐺 be
the edges of the graph
Let 𝑨 𝑉𝐴 , 𝐸𝐴 ⊆ 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the graph of the community,
where 𝑉𝐴 are the vertices of the graph and 𝐸𝐴 be the edges of
the graph.
For each 𝑎 ∈ 𝑉𝐴 {
For each element b in the neighborhood of a defined as:
𝑵(𝑎) = { 𝑏 ∈ 𝑉𝐺 ∶ 𝑎𝑏 ∈ 𝐸𝐺 } {
Define a new subgraph of G :
𝑷 𝑉𝑃 , 𝐸𝑃 = 𝑨 𝑉𝐴 , 𝐸𝐴 ∪ 𝑵(𝑏)
Calculate:
#𝑇𝑃 : local network statistics vector in 𝑷 𝑉𝑃 , 𝐸𝑃
#𝑇𝐴 : local network statistics vector in 𝑨 𝑉𝐴 , 𝐸𝐴
If (any (𝑡 ∈ 𝑇𝑝 > 𝑡 ∈ 𝑇𝐴 ) {
Set 𝑏 ∈ 𝐴(𝑉𝐴 , 𝐸𝐴 )
}
}
}
An example of a constrained
web crawler based on
identification of triangles
Experimental results - method
Y is the n x n adjacency matrix of a binary network with elements:
1, if there is an edge between 𝑖 and 𝑗
𝑌𝑖𝑗 =
0, otherwise
Let 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the general graph of all the hyperlink
environment, where 𝑉𝐺 are the vertices of the graph and 𝐸𝐺 be
the edges of the graph
Let 𝑨 𝑉𝐴 , 𝐸𝐴 ⊆ 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the graph of the community,
where 𝑉𝐴 are the vertices of the graph and 𝐸𝐴 be the edges of
the graph.
For each 𝑎 ∈ 𝑉𝐴 {
Undirected dyadic  # edges =
(unsupervised crawler)
1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗
Directed dyadic  #reciprocal =
(mutuality crawler)
1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗 𝑦𝑗𝑖
For each element b in the neighborhood of a defined as:
𝑵(𝑎) = { 𝑏 ∈ 𝑉𝐺 ∶ 𝑎𝑏 ∈ 𝐸𝐺 } {
Define a new subgraph of G :
𝑷 𝑉𝑃 , 𝐸𝑃 = 𝑨 𝑉𝐴 , 𝐸𝐴 ∪ 𝑵(𝑏)
Calculate:
#𝑇𝑃 : local network statistics vector in 𝑷 𝑉𝑃 , 𝐸𝑃
#𝑇𝐴 : local network statistics vector in 𝑨 𝑉𝐴 , 𝐸𝐴
If (any (𝑡 ∈ 𝑇𝑝 > 𝑡 ∈ 𝑇𝐴 ) {
Set 𝑏 ∈ 𝐴(𝑉𝐴 , 𝐸𝐴 )
}
}
}
Undirected triadic  #Triangles =
1≤𝑖<𝑗<ℎ≤𝑛 𝑦𝑖𝑗 𝑦𝑗ℎ 𝑦ℎ𝑖
(Triangle crawler)
Directed triadic  #𝑇𝑟𝑖𝑝𝑙𝑒𝑡𝑠 = 𝑖<𝑗(𝑦𝑖𝑗 + 𝑦𝑗𝑖 )𝐿2𝑖𝑗
(triplet crawler)
Where 𝐿2𝑖𝑗 is the number of “two
path” connecting i and j or j and i.
Experimental results - method
Y is the n x n adjacency matrix of a binary network with elements:
1, if there is an edge between 𝑖 and 𝑗
𝑌𝑖𝑗 =
0, otherwise
Let 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the general graph of all the hyperlink
environment, where 𝑉𝐺 are the vertices of the graph and 𝐸𝐺 be
the edges of the graph
Let 𝑨 𝑉𝐴 , 𝐸𝐴 ⊆ 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the graph of the community,
where 𝑉𝐴 are the vertices of the graph and 𝐸𝐴 be the edges of
the graph.
For each 𝑎 ∈ 𝑉𝐴 {
Undirected dyadic  # edges =
(unsupervised crawler)
1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗
Directed dyadic  #reciprocal =
(mutuality crawler)
1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗 𝑦𝑗𝑖
For each element b in the neighborhood of a defined as:
𝑵(𝑎) = { 𝑏 ∈ 𝑉𝐺 ∶ 𝑎𝑏 ∈ 𝐸𝐺 } {
Define a new subgraph of G :
𝑷 𝑉𝑃 , 𝐸𝑃 = 𝑨 𝑉𝐴 , 𝐸𝐴 ∪ 𝑵(𝑏)
Calculate:
#𝑇𝑃 : local network statistics vector in 𝑷 𝑉𝑃 , 𝐸𝑃
#𝑇𝐴 : local network statistics vector in 𝑨 𝑉𝐴 , 𝐸𝐴
If (any (𝑡 ∈ 𝑇𝑝 > 𝑡 ∈ 𝑇𝐴 ) {
Set 𝑏 ∈ 𝐴(𝑉𝐴 , 𝐸𝐴 )
}
}
}
Undirected triadic  #Triangles =
1≤𝑖<𝑗<ℎ≤𝑛 𝑦𝑖𝑗 𝑦𝑗ℎ 𝑦ℎ𝑖
(Triangle crawler)
Directed triadic  #𝑇𝑟𝑖𝑝𝑙𝑒𝑡𝑠 = 𝑖<𝑗(𝑦𝑖𝑗 + 𝑦𝑗𝑖 )𝐿2𝑖𝑗
(triplet crawler)
Where 𝐿2𝑖𝑗 is the number of “two
path” connecting i and j or j and i.
Experimental results – results(1)
12
Starting set:
400
6 « polititical ecological »
350
blogs
10
300
8
Triplets
Remarks:
250
200
dyad sampler and triplets
150
samplers  closure
Unsupervised and
triangles samplers
 manually stopped
logDyads
Dyad
triangles
logTriangles
6
unsupervised
4
100
2
50
0
logTriplets
0
logUnsupervised
Experimental results – results (2)
Triangles
Dyads
Triplets
≈
• Unsupervised crawler is not manageable (+20000 actors after
4 iterations!!)
• Dyads: did not selected ‘authoritative’ sources + sensitive to
the number of seeds ?
• Triplets seems to be the best solution: take ties direction into
account + take profit of authoritative sources + conservative
• Triangles: problem of network size … but sampled network
can have interesting properties.
Conclusion and further researches
• Pitfalls to avoid:
• Not necessary all relevant information in the core: there is a lot of
information in the periphery of this core.
• Based on human behaviour patterns: not adapted at all for other
kind of networks (words occurencies, proteïns chains,…)
• Do not throw away more classical graph partitionning methods
• Always question your results.
• How to assess efficiency of a crawler? Should communities in
web graph always be topic-centered
• Further researches:
• Analysis and detection of ‘multi-core’ networks
• ‘Random walks’ in complete networks to find recursive patterns
using T.C. assumptions
• Code of the samplers in ‘R’