Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME [email protected] This research is funded by the FRS-FNRS Paper presented at the 15th General Online Reasearch Conference, 4-6 march, Mannheim Online communities – a theoretical definitions • What is an online community? • “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationship in cyberspace” » (Rheingold 2000) • long term involvement (Jones 2006) • sense of community (Blanchard 2008) • temporal perspective (Lin et al 2006) • Probably important … but the first operation should be to take into account the ‘hyperlink environment’ Graph analysis issue / SNA issue Online Communities – A graphical definition (1) • Community = more ties among members than with non-members • three general classes of ‘community’ in graph partitioning algorithm (Fortunato 2010) : – a local definition: focus on the sub-graphs (i.e.: cliques, n-cliques (Luce, 1950), k-plex (Seidman & Foster, 1978), lambda sets (Borgatti et al, 1990), … ) – a global definition: focus on the graph as a whole (observed graph significantly different from a random graph (i.e.: Erdös-Rényi graph)?) – vertex similarity: focus on actors (i.e.: euclidian distance & hierarchical clustering, max-flow/min-cut (Elias et al, 1956; Flake et al, 2000) Online communities – graphical definition (2) • 2 main problems of graph partitionning in a hyperlink environment: • 1) network size / and form (i.e. tree structure) • 2) edges direction • better discover communities with a efficient web crawler Web crawling - Generalities • The general idea for a web crawling process: - We have a number of starting blogs (seeds) - All hyperlink are retrieved from these seeds blogs - For each new website discovered, decide wether this new site is accepted or refused - If the site is accepted, it become a seed and the process is reiterated on this site. Source: Jacomi & Ghitalla (2007) Web crawling – constrain-based web crawler (1) • Two problems of a manual crawler : • Number and quality of decision • Closure? • A solution: taking advantage of local structural properties of a network: Assume that a network is an outcome of the agregation of local social processes: – Examples in SNA: • General philosphy of ERG Models (see f.e. : Robins et al 2007) • Local clustering coefficient (see f.e. : Watts & Strogatz, 1998) Constrain the crawler to identify local social structures (ie: triangles, mutual dyads, transitive triads,… Web crawling – constrain-based web crawler (2) Generalisation Let 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the general graph of all the hyperlink environment, where 𝑉𝐺 are the vertices of the graph and 𝐸𝐺 be the edges of the graph Let 𝑨 𝑉𝐴 , 𝐸𝐴 ⊆ 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the graph of the community, where 𝑉𝐴 are the vertices of the graph and 𝐸𝐴 be the edges of the graph. For each 𝑎 ∈ 𝑉𝐴 { For each element b in the neighborhood of a defined as: 𝑵(𝑎) = { 𝑏 ∈ 𝑉𝐺 ∶ 𝑎𝑏 ∈ 𝐸𝐺 } { Define a new subgraph of G : 𝑷 𝑉𝑃 , 𝐸𝑃 = 𝑨 𝑉𝐴 , 𝐸𝐴 ∪ 𝑵(𝑏) Calculate: #𝑇𝑃 : local network statistics vector in 𝑷 𝑉𝑃 , 𝐸𝑃 #𝑇𝐴 : local network statistics vector in 𝑨 𝑉𝐴 , 𝐸𝐴 If (any (𝑡 ∈ 𝑇𝑝 > 𝑡 ∈ 𝑇𝐴 ) { Set 𝑏 ∈ 𝐴(𝑉𝐴 , 𝐸𝐴 ) } } } An example of a constrained web crawler based on identification of triangles Experimental results - method Y is the n x n adjacency matrix of a binary network with elements: 1, if there is an edge between 𝑖 and 𝑗 𝑌𝑖𝑗 = 0, otherwise Let 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the general graph of all the hyperlink environment, where 𝑉𝐺 are the vertices of the graph and 𝐸𝐺 be the edges of the graph Let 𝑨 𝑉𝐴 , 𝐸𝐴 ⊆ 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the graph of the community, where 𝑉𝐴 are the vertices of the graph and 𝐸𝐴 be the edges of the graph. For each 𝑎 ∈ 𝑉𝐴 { Undirected dyadic # edges = (unsupervised crawler) 1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗 Directed dyadic #reciprocal = (mutuality crawler) 1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗 𝑦𝑗𝑖 For each element b in the neighborhood of a defined as: 𝑵(𝑎) = { 𝑏 ∈ 𝑉𝐺 ∶ 𝑎𝑏 ∈ 𝐸𝐺 } { Define a new subgraph of G : 𝑷 𝑉𝑃 , 𝐸𝑃 = 𝑨 𝑉𝐴 , 𝐸𝐴 ∪ 𝑵(𝑏) Calculate: #𝑇𝑃 : local network statistics vector in 𝑷 𝑉𝑃 , 𝐸𝑃 #𝑇𝐴 : local network statistics vector in 𝑨 𝑉𝐴 , 𝐸𝐴 If (any (𝑡 ∈ 𝑇𝑝 > 𝑡 ∈ 𝑇𝐴 ) { Set 𝑏 ∈ 𝐴(𝑉𝐴 , 𝐸𝐴 ) } } } Undirected triadic #Triangles = 1≤𝑖<𝑗<ℎ≤𝑛 𝑦𝑖𝑗 𝑦𝑗ℎ 𝑦ℎ𝑖 (Triangle crawler) Directed triadic #𝑇𝑟𝑖𝑝𝑙𝑒𝑡𝑠 = 𝑖<𝑗(𝑦𝑖𝑗 + 𝑦𝑗𝑖 )𝐿2𝑖𝑗 (triplet crawler) Where 𝐿2𝑖𝑗 is the number of “two path” connecting i and j or j and i. Experimental results - method Y is the n x n adjacency matrix of a binary network with elements: 1, if there is an edge between 𝑖 and 𝑗 𝑌𝑖𝑗 = 0, otherwise Let 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the general graph of all the hyperlink environment, where 𝑉𝐺 are the vertices of the graph and 𝐸𝐺 be the edges of the graph Let 𝑨 𝑉𝐴 , 𝐸𝐴 ⊆ 𝑮(𝑉𝐺 , 𝐸𝐺 ) be the graph of the community, where 𝑉𝐴 are the vertices of the graph and 𝐸𝐴 be the edges of the graph. For each 𝑎 ∈ 𝑉𝐴 { Undirected dyadic # edges = (unsupervised crawler) 1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗 Directed dyadic #reciprocal = (mutuality crawler) 1≤𝑖<𝑗≤𝑛 𝑦𝑖𝑗 𝑦𝑗𝑖 For each element b in the neighborhood of a defined as: 𝑵(𝑎) = { 𝑏 ∈ 𝑉𝐺 ∶ 𝑎𝑏 ∈ 𝐸𝐺 } { Define a new subgraph of G : 𝑷 𝑉𝑃 , 𝐸𝑃 = 𝑨 𝑉𝐴 , 𝐸𝐴 ∪ 𝑵(𝑏) Calculate: #𝑇𝑃 : local network statistics vector in 𝑷 𝑉𝑃 , 𝐸𝑃 #𝑇𝐴 : local network statistics vector in 𝑨 𝑉𝐴 , 𝐸𝐴 If (any (𝑡 ∈ 𝑇𝑝 > 𝑡 ∈ 𝑇𝐴 ) { Set 𝑏 ∈ 𝐴(𝑉𝐴 , 𝐸𝐴 ) } } } Undirected triadic #Triangles = 1≤𝑖<𝑗<ℎ≤𝑛 𝑦𝑖𝑗 𝑦𝑗ℎ 𝑦ℎ𝑖 (Triangle crawler) Directed triadic #𝑇𝑟𝑖𝑝𝑙𝑒𝑡𝑠 = 𝑖<𝑗(𝑦𝑖𝑗 + 𝑦𝑗𝑖 )𝐿2𝑖𝑗 (triplet crawler) Where 𝐿2𝑖𝑗 is the number of “two path” connecting i and j or j and i. Experimental results – results(1) 12 Starting set: 400 6 « polititical ecological » 350 blogs 10 300 8 Triplets Remarks: 250 200 dyad sampler and triplets 150 samplers closure Unsupervised and triangles samplers manually stopped logDyads Dyad triangles logTriangles 6 unsupervised 4 100 2 50 0 logTriplets 0 logUnsupervised Experimental results – results (2) Triangles Dyads Triplets ≈ • Unsupervised crawler is not manageable (+20000 actors after 4 iterations!!) • Dyads: did not selected ‘authoritative’ sources + sensitive to the number of seeds ? • Triplets seems to be the best solution: take ties direction into account + take profit of authoritative sources + conservative • Triangles: problem of network size … but sampled network can have interesting properties. Conclusion and further researches • Pitfalls to avoid: • Not necessary all relevant information in the core: there is a lot of information in the periphery of this core. • Based on human behaviour patterns: not adapted at all for other kind of networks (words occurencies, proteïns chains,…) • Do not throw away more classical graph partitionning methods • Always question your results. • How to assess efficiency of a crawler? Should communities in web graph always be topic-centered • Further researches: • Analysis and detection of ‘multi-core’ networks • ‘Random walks’ in complete networks to find recursive patterns using T.C. assumptions • Code of the samplers in ‘R’
© Copyright 2026 Paperzz