PTE: Enumerating Trillion Triangles On Distributed Systems Ha-Myung Park KAIST [email protected] Sung-Hyon Myaeng U Kang∗ KAIST Seoul National University [email protected] [email protected] ABSTRACT 1. How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time. In this paper, we propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. PTE enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read. Experimental results show that PTE provides up to 47× faster performance than recent distributed algorithms on real world graphs, and succeeds in enumerating more than 3 trillion triangles on the ClueWeb12 graph with 6.3 billion vertices and 72 billion edges, which any previous triangle computation algorithm fail to process. How can we enumerate trillion triangles from an enormous graph with billions of vertices and edges? The problem of triangle enumeration is to discover every triangle in a graph one by one where a triangle is a set of three vertices connected to each other. The problem has numerous graph mining applications including detecting suspicious accounts like advertisers or fake users in social networks [29, 16], uncovering hidden thematic layers on the web [8], discovering roles [5], detecting web spams [3], finding communities [4, 24], etc. A challenge in triangle enumeration is handling big real world networks, such as social networks and WWW, which have millions or billions of vertices and edges. For example, Facebook and Twitter have 1.39 billion [9] and 300 million active users [27], respectively, and at least 1 trillion unique URLs are on the web [1]. Even recently proposed algorithms, however, fail to enumerate triangles from such large graphs. The algorithms have been proposed in different ways: I/O efficient [13, 21, 19], distributed memory [2, 11], and MapReduce algorithms [6, 22, 23, 26]. These algorithms have a limited scalability. The I/O efficient algorithms use only a single machine, and thus they cannot process a graph exceeding the external memory space of the machine. The distributed memory algorithms use multiple machines, but they also cannot process a graph whose intermediate data exceed the capacity of distributed-memory. The state of the art MapReduce algorithm [23], named CTTP, significantly increases the size of processable dataset by dividing the entire task into several sub-tasks and processing them in separate MapReduce rounds. Even CTTP, however, takes a very long time to process an enormous graph because, in every round, CTTP reads the entire dataset and shuffles a lot of edges. Indeed, shuffling a large amount of data in a short time interval causes network congestion and heavy I/O to disks which decrease the scalability and the fault tolerance, and prolong the running time significantly. Thus, it is desirable to shrink the amount of shuffled data. In this paper, we propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in an enormous graph by resolving the structural inefficiency of the previous MapReduce algorithms. We show that PTE successfully enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read. The main contributions of this paper are summarized as follows: CCS Concepts •Information systems → Data mining; •Theory of computation → Parallel algorithms; Distributed algorithms; MapReduce algorithms; Graph algorithms analysis; Keywords triangle enumeration; big data; graph algorithm; scalable algorithm; distributed algorithm; network analysis ∗ Corresponding Author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KDD ’16, August 13-17, 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00 DOI: http://dx.doi.org/10.1145/2939672.2939757 INTRODUCTION Running Time (min) 104 PTESC PTECD PTEBASE CTTP MGT GraphLab 103 27x 37x 102 101 100 40x 15x TWT 47x 17x SD YW CW09 CW12 Figure 1: The running time of proposed methods (PTESC , PTECD , PTEBASE ) and competitors (CTTP, MGT, GraphLab) on real world datasets (log scale). GraphX is not shown since it failed to process any of the datasets. Missing methods for some datasets mean they failed to run on the datasets. PTESC shows the best performances outperforming CTTP and MGT by up to 47× and 17×, respectively. Only the proposed algorithms succeed in processing the ClueWeb12 graph containing 6.3 billion vertices and 72 billion edges. • We propose PTE, a new distributed algorithm for enumerating triangles in an enormous graph, which is designed to minimize the amount of shuffled data, total work, and network read. • We prove the efficiency of the proposed algorithm: the algorithm operates in O(|E|) shuffled data, √ O(|E|3/2 / M ) network read, and O(|E|3/2 ) total work, the worst case optimal, where |E| is the number of edges of a graph and M is the available memory size of a machine. • Our algorithm is experimentally evaluated using large real world networks. The results demonstrate that our algorithm outperforms the best previous distributed algorithms by up to 47× (see Figure 1). Moreover, PTE successfully enumerates more than 3 trillion triangles in the ClueWeb12 graph containing 6.3 billion vertices and 72 billion edges. Any previous algorithms, including GraphLab, GraphX, and CTTP, fail to process the graph because of massive intermediate data. The codes and datasets used in this paper are provided in http://datalab.snu.ac.kr/pte. The remaining part of the paper is organized as follows. In Section 2, we review the previous researches related to the triangle enumeration. In Section 3, we formally define the problem and introduce important concepts and notations used in this paper. We introduce the details of our algorithm in Section 4. The experimental results are given in Section 5. Finally, we conclude in Section 6. The symbols frequently used in this paper are summarized in Table 1. 2. RELATED WORK In order to handle enormous graphs, several triangle enumeration algorithms have been proposed recently. In this section, we introduce the distinct approaches of the algorithms, including recent MapReduce algorithms related to our work. We also outline the MapReduce model and emphasize the importance of reducing the amount of shuffled data in improving the performance. Table 1: Table of symbols. Symbol Definition G = (V, E) Simple graph with the set V of vertices and the set E of edges. u, v, n Vertices. i, j, k Vertex colors. (u, v) Edge between u and v where u ≺ v. (u, v, n) Triangle with vertices u, v, and n where u ≺ v ≺ n. (i, j, k), (i, j) Subproblems. d(u) Degree (number of neighbors) of u. id(u) Vertex number of u, a unique identifier. ≺ Total order on V . u ≺ v means u precedes v. ρ Number of vertex colors. ξ Coloring function: V → {0, · · · , ρ − 1}. ξ(u) is the color of vertex u. Eij Set of edges (u, v) where (ξ(u), ξ(v)) = (i, j) or (j, i). ? Eij Set of edges (u, v) where (ξ(u), ξ(v)) = (i, j). M Available memory size of a machine. P Number of processors in a distributed system. 2.1 I/O Efficient Triangle Algorithms Recently, several triangle enumeration algorithms have been proposed in I/O efficient ways to handle graphs that do not fit into the main memory [13, 21, 19]. Hu et al. [13] propose Massive Graph Triangulation (MGT) which buffers a certain number of edges on the memory and finds all triangles containing one of these edges by traversing every vertex. Pagh and Silvestri [21] propose a cache oblivious algorithm which colors the vertices of a graph hierarchically so that it does not need to know the cache structure of a system. Kim et al. [19] present a parallel external-memory algorithm exploiting the features of a solid-state drive (SSD). These algorithms, however, cannot process a graph exceeding the external memory space of a single machine. Moreover, these algorithms cannot output all the triangles in a large graph containing numerous triangles; for example, the ClueWeb12 graph (see Section 5) has 3 trillion triangles requiring 70 Terabytes of storage. 2.2 Distributed-Memory Triangle Algorithms The triangle enumeration problem has been recently targeted in the distributed-memory model which assumes a multi-processor system where each processor has its own memory. We call a processor with a memory a machine. Arifuzzaman et al. [2] propose a distributed-memory algorithm based on Message Passing Interface (MPI). The algorithm divides a graph into overlapping subgraphs and finds triangles in each subgraph in parallel. GraphLab-PowerGraph (shortly GraphLab) [11], which is an MPI-based distributed graph computation framework, provides an implementation for triangle enumeration. GraphLab copies each vertex and its outgoing edges γ times on average to multiple machines where γ is determined by the characteristic of the input graph and the number of machines. Thus, γ|E| data are replicated in total, and GraphLab fails when γ|E|/P ≥ M where P is the number of processors and M is the available memory size of a machine. GraphX, a graph computation library for Spark, also provides an implementation of the same algorithm as in GraphLab; thus it has the same limitation in scalability. PDTL [10] is a parallel and distributed extension of MGT. The experiment of the paper shows impressive speed of PDTL, but it has limited scalability: 1) every machine must hold a copy of an entire graph, 2) a part of PDTL runs on a single machine, which can be a performance bottleneck, and 3) it stores entire triangles in a single machine. In summary, all the previous distributed memory algorithms are limited in handling large graphs. 2.3 MapReduce MapReduce [7] is a programming model supporting parallel and distributed computation to process large data. MapReduce is highly scalable and easy to use, and thus it has been used for various important graph mining and data mining tasks such as radius [18], graph queries [17], triangle [16, 22, 23], visualization [15], and tensors [14]. A MapReduce round transforms an input set of key-value pairs to an output set of key-value pairs by three steps: it first transforms each pair of the input to a set of new pairs (map step), groups the pairs by key so that all values with the same key are aggregated together (shuffle step), and processes the values by each key separately and outputs a new set of key-value pairs (reduce step). The amount of shuffled data significantly affects the performance of a MapReduce task because shuffling includes heavy tasks of writing, sorting, and reading the data [12]. In detail, each map worker buffers the pairs from the map step in memory (collect). The buffered pairs are partitioned into R regions and written to local disk periodically where R is the number of reduce workers (spill). Each reduce worker remotely reads the buffered data from the local disks of the map workers via a network (shuffle). When a reduce worker has read all the pairs for its partition, the reduce worker sorts the pairs by keys so that all values with the same key are grouped together (merge and sort). Because of such heavy I/O and network traffic, a large amount of shuffled data decreases the performance significantly. Thus, it is desirable to shrink the amount of shuffled data as much as possible. 2.4 MapReduce Triangle Algorithms Several triangle computation algorithms have been designed in MapReduce. We review the algorithms in terms of the amount of shuffled data. The first MapReduce algorithm is proposed by Cohen [6]. The algorithm is a variant of node-iterator [25], a well-known sequential algorithm. It shuffles O(|E|3/2 ) length-2 paths (also known as wedges) in a graph. Suri and Vassilvitskii [26] reduce the amount of √ shuffled data to O(|E|3/2 / M ) by proposing a graph partitioning based algorithm Graph Partition (GP). Considering types of triangles, Park and Chung [22] improve GP by a constant factor in their algorithm, Triangle √ Type Partition (TTP). That is, TTP also shuffles O(|E|3/2 / M ) data during the process. Aforementioned algorithms cause an out of space error when the size of shuffled data is larger than the total available space. Park et al. [23] avoid the out of space error by introducing a multi-round algorithm, namely Colored TTP (CTTP). CTTP limits the shuffled data size of a round, and thus significantly increases the size of processable data. However, CTTP still shuffles √ the same amount of data as TTP does, that is, O(|E|3/2 / M ). Note that our proposed algorithm in this paper reduces the amount of shuffled data to O(|E|), improving the performance significantly. 3. PRELIMINARIES In this section, we define the problem that we are going to solve, introducing several terms and notations formally. We also describe two previously introduced major algorithms for distributed triangle enumeration. 3.1 Problem Definition We first define the problem of triangle enumeration. Definition 1. (Triangle enumeration) Given a simple graph G = (V, E), the problem of triangle enumeration is to discover every triangle in the graph G one by one, where a simple graph is an undirected graph containing no loops or no duplicate edges, and a triangle is a set of three vertices fully connected to each other. Note that we do not require the algorithm to retain or emit each triangle (u, v, n) into any memory system, but to call a local function enum(·) with the triangle as the parameter. On a vertex set V , we define a total order to uniquely express an edge or a triangle. Definition 2. (Total order on V ) The order of two vertices u and v is determined as follows: • u ≺ v if d(u) < d(v) or (d(u) = d(v) and id(u) < id(v)) where d(u) is the degree, and id(u) is the unique identifier of a vertex u. We denote by (u, v) an edge between two vertices u and v, and by (u, v, n) a triangle consisting of three vertices u, v, and n. Unless otherwise noted, the vertices in an edge (u, v) have the order of u ≺ v, and we presume it has a direction, from u to v, even though the graph is undirected. Similarly, the vertices in a triangle (u, v, n) also has the order of u ≺ v ≺ n, and we give each edge a name to simplify the description as follows (see Figure 2): Definition 3. For a triangle (u, v, n) where the vertices are in the order of u ≺ v ≺ n, we call (u, v) pivot edge, (u, n) port edge, and (v, n) starboard edge. u port edge pivot edge n v starboard edge Figure 2: A triangle with directions by the total order on the three vertices. 3.2 Triangle Enumeration in TTP and CTTP Park and Chung [22] propose a MapReduce algorithm, named Triangle Type Partition (TTP), for enumerating triangles. We introduce TTP briefly because of its relevance to √ our work, and we show that it shuffles O(|E|3/2 / M ) data. The p first step of TTP is to color the vertices with ρ = O( |E|/M ) colors by a hash function ξ : V → {0, · · · , ρ − 1}. Let Eij with i, j ∈ {0, · · · , ρ − 1} and i ≤ j be the set {(u, v) ∈ E | i = min(ξ(u), ξ(v)) and j = max(ξ(u), ξ(v))}. A triangle is classified as type-1 if all the vertices in the triangle have the same color, type-2 if there are exactly two vertices with the same color, and type-3 if no vertices have the same color. TTP divides the entire problem into ρ2 + ρ3 subproblems of two types: (i, j) subproblem, with i, j ∈ {0, · · · , ρ − 1} and i < j, is to enumerate triangles in an edge-induced subgraph on 0 Eij = Eij ∪ Eii ∪ Ejj . It finds every triangle of type-1 and type-2 where the vertices are colored with i and j. There are ρ2 subproblems of this type. (i, j, k) subproblem, with i, j, k ∈ {0, · · · , ρ − 1} and i < j < k, is to enumerate triangles in an edge-induced 0 subgraph on Eijk = Eij ∪ Eik ∪ Ejk . It finds every triangle of type-3 where the vertices are colored with i, j and k. There are ρ3 subproblems of this type. Algorithm 1: Graph Partitioning 1 2 /* ψ is a meaningless dummy key Map : input hψ; (u, v) ∈ Ei emit h(ξ(u), ξ(v)); (u, v)i ?i Reduce : input h(i, j); Eij ? to a distributed storage emit Eij Algorithm 2: Triangle Enumeration (PTEBASE ) */ 1 2 ? ? ? /* Eij = Eij ∪ Eji , and Eii = Eii 3 4 Each map task of TTP gets an edge e ∈ E and emits 0 0 h(i, j); ei and h(i, j, k); ei for every Eij and Eijk containing e, 0 respectively. Thus, each reduce task gets a pair h(i, j); Eij i 0 or h(i, j, k); Eijk i, and finds all triangle in the edge-induced subgraph. For each edge, a map task emits ρ − 1 key-value √ pairs (Lemma 2 in [22]); that is, O(|E|ρ) = O(|E|3/2 / M ) pairs are shuffled in total. TTP fails to process a graph when the shuffled data size is larger than the total available space. CTTP [23] avoids the failure by dividing the tasks into multiple rounds and limiting the shuffled data size of a round. However, CTTP still shuffles exactly the same pairs as TTP; hence CTTP also suffers from the massive shuffled data resulting in a very long running time. 4. PROPOSED METHOD In this section, we propose PTE (Pre-partitioned Triangle Enumeration), a distributed algorithm for enumerating triangles in an enormous graph. There are several challenges in designing an efficient and scalable distributed algorithm for triangle enumeration. 1. Minimize shuffled data. Massive data are shuffled for generating subgraphs by the previous algorithms. How can we minimize the amount of shuffled data? 2. Minimize computations. The previous algorithms contain several kinds of redundant operations (details in Section 4.2). How can we remove the redundancy? 3. Minimize network read. In previous algorithms, each subproblem reads necessary sets of edges via network, and the amount of network read is determined by the number of vertex colors. How can we decrease the number of vertex colors to minimize network read? We have the following main ideas to address the above challenges, which are described in detail in later subsections. 1. Separating graph partitioning from generating subgraphs decreases √ the amount of shuffled data to O(|E|) from O(|E|3/2 / M ) of the previous MapReduce algorithms (Section 4.1). 2. Considering the color-direction of edges removes redundant operations and minimizes computations (Section 4.2). 3. Carefully scheduling triangle computations in subproblems shrinks the amount of network read by decreasing the number of vertex colors (Section 4.3). In the following we first describe PTEBASE which exploits pre-partitioning to decrease the shuffled data (Section 4.1). Then, we describe PTECD which improves on PTEBASE to remove redundant operations (Section 4.2). After that, we explain our desired method PTESC which further improves on PTECD to shrink the amount of network read (Section 4.3). The theoretical analysis of the methods and implementation issues are discussed in the end (Sections 4.4 and 4.5). Note that although we describe PTE using MapReduce primitives for simplicity, PTE is general enough to be implemented in any distributed framework (discussions in Section 4.5 and experimental comparisons in Section 5.2.3). /* ψ is a meaningless dummy key Map : input hψ; problem = (i, j) or (i, j, k)i initialize E 0 if problem is of type (i, j) then 5 else if problem is of type (i, j, k) then 7 read Eij , Eik , Ejk E 0 ← Eij ∪ Eik ∪ Ejk 8 enumerateTriangles(E 0 ) /* enumerate triangles in the edge-induced subgraph on E 9 10 11 12 13 14 15 16 */ read Eij , Eii , Ejj E 0 ← Eij ∪ Eii ∪ Ejj ? ? ? ? /* Eij = Eij ∪ Eji , Eik = Eik ∪ Eki , and ? ? Ejk = Ejk ∪ Ekj 6 */ */ */ Function enumerateTriangles(E) foreach (u, v) ∈ E do foreach n ∈ {nu |(u, nu ) ∈ E} ∩ {nv |(v, nv ) ∈ E} do if ξ(u) = ξ(v) = ξ(n) then if (ξ(u) = i and i + 1 ≡ j mod ρ) or (ξ(u) = j and j + 1 ≡ i mod ρ) then enum((u, v, n)) else enum((u, v, n)) 4.1 PTEBASE : Pre-partitioned Triangle Enumeration In this section we propose PTEBASE which rectifies the massive shuffled data problem of previous MapReduce algorithms. The main idea is partitioning an input graph into sets of edges before enumerating triangles, and storing the sets in a distributed storage like Hadoop Distributed File System (HDFS) of Hadoop, or Resilient Distributed Dataset (RDD) of Spark. We observe that the subproblems of TTP require each set Eij of edges as a unit. It implies that if each edge set Eij is directly accessible from a distributed storage, we need not shuffle the edges like TTP does. Con sequently, we partition the input graph into ρ + ρ2 = ρ(ρ+1) 2 sets ofedges according to the vertex colors in each edge; ρ and ρ2 are for Eij when i = j and i < j, respectively. Each edge (u, v) ∈ Eij keeps the order u ≺ v. Each vertex is colored by a coloring function ξ which is randomly chosen from a pairwise independent family of functions [28]. The pairwise independence of ξ guarantees p that edges are evenly distributed. PTEBASE sets ρ to d 6|E|/M e to fit the three edge sets for an (i, j, k) subproblem into the memory of size M in a processor: the expected size of an edge set is 2|E|/ρ2 , and the sum 6|E|/ρ2 of the size of the three edge sets should be less than or equal to the memory size M . After the graph partitioning, PTEBASE reads edge sets and finds triangles in each subproblem. In each (i, j) subproblem, it reads Eij , Eii , and Ejj , and enumerates triangles in the union of the edge sets. In each (i, j, k) subproblem, similarly, it reads Eij , Eik , and Ejk , and enumerates triangles in the union of the edge sets. The edge sets are read from a distributed storage via a network, and the total amount of network read is O(|E|ρ) (see Section 3.2). Note that the network read is different from the data shuffle; the data shuffle is a much heavier task since it requires data collecting and writing in senders, data transfer via a network, and data merging and sorting in receivers (see Section 2.3). However, the network read contains data transfer via a network only. u n Eik b (a) ? Eik Algorithm 3: Triangle Enumeration (PTECD ) /* ψ is a meaningless dummy key Eij? Eij v a u n v 1 2 ? Ejk Ejk (b) Figure 3: (a) An example of finding type-3 triangles containing an edge (u, v) ∈ Eij in an (i, j, k) subproblem. PTEBASE finds the triangle (u, v, b) by intersecting u’s neighbor set {v, a, b} and v’s neighbor set {n, b}. However, it is unnecessary to consider the edges in Eij since the other two edges of a type-3 triangle containing (u, v) must be in Eik and Ejk , re? ? ? spectively. (b) enumerateTrianglesCD(Eij , Eik , Ejk ) in PTECD finds every triangle whose pivot edge, port edge, and starboard edge have the same color? ? ? directions as those of Eij , Eik , and Ejk , respectively. The arrows denote the color-directions. Pseudo codes for PTEBASE are listed in Algorithm 1 and Algorithm 2. The graph partitioning is done by a pair of map and reduce steps (Algorithm 1). In the map step, PTEBASE transforms each edge (u, v) into a pair h(ξ(u), ξ(v)); (u, v)i (line 1). The edges of the pairs are aggregated by the keys; ? and emits and for each key (i, j), a reduce task receives Eij it to a separate file in a distributed storage (line 2), where ? Eij is {(u, v) ∈ E|(ξ(u), ξ(v)) = (i, j)}. Note that the union ? ? ? ? of Eij and Eji is Eij , that is, Eij ∪ Eji = Eij . Thanks to the pre-partitioned edge sets, the triangle enumeration is done by a single map step (see Algorithm 2). Each map task reads edge sets needed to solve a subproblem (i, j) or (i, j, k) (lines 3, 6), makes the union of the edge sets (lines 4, 7), and enumerates triangles with a sequential algorithm enumerateTriangles (line 8). Although any sequential algorithm for triangle enumeration can be used for enumerateTriangles, we use CompactForward [20], one of the best sequential algorithms, with a modification (lines 9-16): we skip the vertex sorting procedure of CompactForward because the edges are already ordered by the degrees of its vertices. Given a set E 0 of edges, enumerateTriangles runs in O(|E 0 |3/2 ) total work, the same as that of CompactForward. Note that we propose a specialized algorithm to reduce the total work in Section 4.2. Note also that although every type-1 triangle appears ρ − 1 times, PTEBASE emits the triangle only once: for each type-1 triangle of color i, PTEBASE emits the triangle if and only if i + 1 ≡ j mod ρ given a subproblem (i, j) or (j, i) (lines 12-14). Still, a type1 triangle is computed ρ − 1 times which is unnecessary. We completely resolve the issue in Section 4.2. 4.2 PTECD : Reducing the Total Work PTECD improves on PTEBASE to minimize the amount of computations by exploiting color-direction. We first give an example (Figure 3(a)) to show that the function enumerateTriangles in Algorithm 2 performs redundant operations. Let us consider finding type-3 triangles containing an edge (u, v) ∈ Eij in an (i, j, k) subproblem. enumerateTriangles finds such triangles by intersecting the two outgoing neighbor sets of u and v. In the figure, the neighbor sets are {v, a, b} and {n, b}; and we find the triangle (u, v, b). However, it is unnecessary to consider edges in Eij (that is, (u, v), (u, a), (v, n)) since the other two edges of a type-3 tri- /* enumerate Type-1 triangles 3 4 5 6 8 9 10 else if j = ρ − 1 and i = 0 then ? , E? , E? ) enumerateTrianglesCD(Ejj jj jj 12 13 14 15 16 17 18 */ foreach (x, y, z) ∈ {(i, i, j), (i, j, i), (j, i, i), (i, j, j), (j, i, j), (j, j, i)} do ? , E? , E? ) enumerateTrianglesCD(Exy xz yz else if problem is of type (i, j, k) then enumerateType3Triangles((i, j, k)) /* enumerate every triangle (u, v, n) such that ξ(u) = i, ξ(v) = j and ξ(n) = k 11 */ if i + 1 = j then ? , E? , E? ) enumerateTrianglesCD(Eii ii ii /* enumerate Type-2 triangles 7 */ Map : input hψ; problem = (i, j) or (i, j, k)i if problem is of type (i, j) then ? , E? , E? , E? read Eij ji ii jj */ ? , E? , E? ) Function enumerateTrianglesCD(Eij ik jk ? do foreach (u, v) ∈ Eij foreach ? } ∩ {n |(v, n ) ∈ E ? } do n ∈ {nu |(u, nu ) ∈ Eik v v jk enum((u, v, n)) Function enumerateType3Triangles(i, j, k ) ? , E? , E? , E? , E? , E? read Eij ji ik jk ki kj foreach (x, y, z) ∈ {(i, j, k), (i, k, j), (j, i, k), (j, k, i), (k, i, j), (k, j, i)} do ? , E? , E? ) enumerateTrianglesCD(Exy xz yz angle containing (u, v) must be in Eik and Ejk , respectively. The redundant operations can be removed by intersecting u’s neighbors only in Eik and v’s neighbors only in Ejk instead of looking at all the neighbors of u and v. In the figure, the two neighbor sets are both {b}; and we find the same triangle (u, v, b). PTECD removes the redundant operations by adopting a new function enumerateTrianglesCD (lines 11-14 in Algorithm 3). We define the color-direction of an edge (u, v) to be from ξ(u) to ξ(v); and we also define ? the color-direction of Eij to be from i to j. Then, enumer? ? ? ateTrianglesCD(Eij , Eik , Ejk ) finds every triangle whose pivot edge, port edge, and starboard edge have the same ? ? ? , respectively color-directions as those of Eij , Eik , and Ejk (see Figure 3(b)). Note that the algorithm does not look at any edges in Eij for the intersection in the example of Figure 3(a) by separating the input edge sets. Redundant operations of another type appear in (i, j) subproblems; PTEBASE outputs a type-1 triangle exactly once but still computes it multiple times: type-1 triangles with a color i appears ρ − 1 times in (i, j) or (j, i) subproblems for j ∈ {0, · · · , ρ − 1} \ {i}. PTECD resolves the duplicate computation by performing enumerateTrian? ? ? glesCD(Eii , Eii , Eii ) exactly once for each vertex color i; and thus, every type-1 triangle appears only once. Algorithm 3 shows PTECD using the new function enumerateTrianglesCD. To find type-3 triangles in each (i, j, k) subproblem, PTECD calls the function enumerateTrianglesCD 6 times for every possible color-direction (lines 10, 15-18) (see Figure 4(a)). To find type-2 triangles in each (i, j) subproblem, similarly, PTECD calls enumerateTrianglesCD 6 times (lines 7-8) (see Figure 4(b)). For type-1 triangles Eik Eij Eii Ejk i≺j≺k j≺i≺k k≺i≺j i≺k≺j j≺k≺i k≺j≺i Ejj Eij i≺i≺j i≺j≺i j≺i≺i (a) The six color-directions of a type-3 triangle in j≺j≺i j≺i≺j i≺j≺j (b) The six color-directions of a type-2 triangle in an an (i, j, k) subproblem. (i, j) subproblem. Figure 4: The color-directions of a triangle according to its type. The function enumerateTrianglesCD is called for each color-direction; that is, PTECD calls it 6 times for type-3 triangles and type-2 triangles, respectively. whose vertices have a color i, PTECD performs enumerWe add O(ρ) because (d? (u) + d? (v))/ρ can be less than 1 ? ? ? ateTrianglesCD(Eii , Eii , Eii ) only if i + 1 ≡ j mod ρ but d?ξ(u)ξ(v) (u) + d?ξ(v)ξ(u) (v) is always larger than or equal given a subproblem (i, j) or (j, i) so that enumerateTrianto 1. Meanwhile, the number Pof operations by PTECD for ? ? ? glesCD(Eii , Eii , Eii ) operates exactly once (lines 3-6). As intersecting neighbor sets is (u,v)∈E (d? (u) + d? (v)) as we a result, the algorithm emits every triangle exactly once. will see in Theorem 5. Thus, PTECD reduces the number Removing the two types of redundant operations, PTECD of operations by more than 2 − ρ2 times from PTEBASE in expectation. decreases the number of operations for intersecting neighbor sets by more than 2− ρ2 times from PTEBASE in expectation. 4.3 PTESC : Reducing the Network Read As we will see in Section 5.2.1, PTECD decreases the operPTESC further improves on PTECD to shrink the amount ations by up to 6.83× than PTEBASE on real world graphs. of network read by scheduling calls of the function enumer? Theorem 1. PTECD decreases the number of operations in ρ − 1 subproblems, ateTrianglesCD. Reading each Eij for intersecting neighbor sets by more than 2 − ρ2 times comPTECD (as well as PTEBASE ) reads O(|E|ρ) data via a net? pared to PTEBASE in expectation. work in total. For example, E01 is read in every (0, 1, k) subproblem for 2 ≤ k < ρ, and the (0, 1) subproblem. It Proof. To intersect the sets of neighbors of u and v implies that the amount of network read depends on ρ, the in an edge (u, v) such that ξ(u) = i, ξ(v) = j and i 6= number of vertex colors. PTEBASE and PTECD set ρ to j, the function enumerateTriangles in PTEBASE performs p d 6|E|/M e as d?ij (u)+d?ik (u)+d?ji (v)+d?jk (v) operations while enumeratepmentioned in Section 4.1. In PTESC , we TrianglesCD in PTECD performs d?ik (u) + d?jk (v) operations reduce it to d 5|E|/M e by setting the sequence of trianfor each color k ∈ {0, · · · , ρ − 1} \ {i, j} where d?ij (u) is the gle computation as in Figure 5 which represents the sched? number of u’s neighbors in Eij . Thus, PTEBASE performs ule of data loading for an (i, j, k) subproblem. We denote (ρ − 2) × (d?ξ(u)ξ(v) (u) + d?ξ(v)ξ(u) (v)) additional operations by ∆ijk the set of triangles enumerated by enumerateTri? ? ? anglesCD(Eij , Eik , Ejk ). PTECD handles the triangle sets compared to PTECD for each (u, v) ∈ E out where E out is one by one from left to right in the figure. The check-marks the set of edges (u, v) ∈ E such that ξ(u) 6= ξ(v); that is, X (X) show the relations between edge sets and triangle sets. ? ? (ρ − 2) × dξ(u)ξ(v) (u) + dξ(v)ξ(u) (v) (1) ? ? ? should be retained in the , and Ejk For example, Eij , Eik out (u,v)∈E memory together to enumerate ∆ijk . When we restrict to Given an edge (u, v) such that ξ(u) = ξ(v) = i, PTEBASE read an edge set only once in a subproblem, the shaded arperforms d?ii (u) + d?ij (u) + d?ii (v) + d?ij (v) operations for each eas represent when the edge sets are in the memory. For ? color j ∈ {0, · · · , ρ − 1} \ {i}; meanwhile, PTECD performs example, Eij is read before ∆ijk , and is released after ∆kij . d?ij (u) + d?ij (v) operations for each color j ∈ {0, · · · , ρ − 1}. Then, we can easily see that the maximum number of edge ? ? Thus, PTEBASE performs (ρ−2)×(dξ(u)ξ(v) (u)+dξ(v)ξ(u) (v)) sets which are retained in the memory p together at a time more operations than PTECD for each (u, v) ∈ E in where is 5, and it leads to setting ρ to d 5|E|/M e. The proE in is E \ E out ; that is, X (ρ − 2) × ? ? dξ(u)ξ(v) (u) + dξ(v)ξ(u) (v) (2) (u,v)∈E in Then, the total number of additional operations performed by PTEBASE , compared to PTECD , is the sum of (1) and (2): (ρ − 2) × X ? ? dξ(u)ξ(v) (u) + dξ(v)ξ(u) (v) (3) (u,v)∈E The expected value of d?ξ(u)ξ(v) (u) is d? (u)/ρ where d? (u) is the number of neighbors v of u such that u ≺ v, since the coloring function ξ is randomly chosen from a pairwise independent family of functions. Thus, (3) becomes as follows: (ρ − 2) × ρ X (u,v)∈E ? ? d (u) + d (v) + O(ρ) (4) ? Eij ? Eik ? Eji ? Ejk ? Eki ? Ekj ∆ijk X X ∆ikj X X X X ∆jik X X X ∆jki X X X ∆kij X ∆kji X X X X X Figure 5: The schedule of data loading for an (i, j, k) subproblem. PTESC enumerates triangles in columns one by one from left to right. Each checkmark (X) shows the relations between edge sets and triangle types. Each shaded area denotes when the edge set is in the memory, when we restrict to read an edge set only once in a subproblem. Algorithm 4: Type-3 triangle enumeration in PTESC 1 2 3 4 Function enumerateType3Triangles(i, j, k ) ? , E? , E? , E? , E? read Eij ji ik jk kj foreach (x, y, z) ∈ {(i, j, k), (i, k, j), (j, i, k)} do ? , E? , E? ) enumerateTrianglesCD(Exy xz yz 5 6 7 8 ? release Eik ? read Eki foreach (x, y, z) ∈ {(j, k, i), (k, i, j), (k, j, i)} do ? , E? , E? ) enumerateTrianglesCD(Exy xz yz cedure of type-3 triangle enumeration with the scheduling method is described in Algorithm 4 which replaces the function enumerateType3Triangles in Algorithm 3. Note that the number 5 of edge sets loaded in the memory at a time is optimal as shown in the following theorem. Theorem 2. Given an (i, j, k) subproblem, the maximum number of edge sets retained in the memory at a time cannot be smaller than 5, if each edge set can be read only once. Proof. Suppose that there is a schedule to make the maximum number of edge sets retained in the memory at a time less than 5. Then, we first read 4 edge sets in the memory. Then, 1) any edge set in the memory cannot be released until all triangles containing an edge in the set have been enumerated, and 2) we cannot read another edge set until we release one in the memory. Thus, for at least one edge set in the memory, it should be able to process all triangles containing an edge in the edge set without reading an additional edge set. However, it is impossible because enumerating all triangles containing an edge in an edge set requires 5 edge sets but we have only 4 edge sets. For example, the triangles in ∆ijk , ∆ikj , and ∆kij , which are related ? ? ? ? ? ? , Eik , Ejk , Ekj , and Eki . , require Eij to an edge set Eij Thus, there is no schedule to make the maximum number of edge sets retained in the memory at a time less than 5. 4.4 Analysis In this section we analyze the proposed algorithm in terms of the amount of shuffled data, network read, and total work. We first prove the claimed amount of shuffled data generated by the graph partitioning in Algorithm 1. Theorem 3. The amount of shuffled data for partitioning a graph is O(|E|) where |E| is the number of edges in the graph. Proof. The pairs emitted from the map operation is exactly the data to be shuffled. For each edge, a map task emits one pair; accordingly, the amount of shuffled data is the number |E| of edges in the graph. We emphasize that while √ the previous MapReduce algorithms shuffle O(|E|3/2 / M ) data, we reduce it to be O(|E|). Instead of data shuffle requiring heavy disk I/O, network read, and massive intermediate data, we only √ require the same amount of network read, that is O(|E|3/2 / M ). √ Theorem 4. PTE requires O(|E|3/2 / M ) network read. ? Proof. We first show that every Eij for (i, j) ∈ 2 ? {0, · · · , ρ − 1} are read ρ − 1 times. It is clear that Eij such that i = j is read ρ − 1 times in (i, k) or (k, i) subprob? lems for k ∈ {0, · · · , ρ − 1} \ {i}. We now consider Eij for i 6= j. Without loss of generality, we assume i < j. Then, ? Eij is read ρ − 2 times in (i, j, k) or (i, k, j) or (k, i, j) subproblems where k ∈ {0, · · · , ρ − 1} \ {i, j}, and once in an (i, j) subproblem; ρ − 1 times in total. The total amount of data read by PTE is as follows: (ρ−1) ρ−1 X ρ−1 X s ? |Eij | = |E|(ρ−1) = |E| i=0 j=0 5|E| − 1 = O M |E|3/2 √ M ! ? ? where |Eij | is the number of edges in Eij . Finally, we prove the claimed total work of the proposed algorithm. Theorem 5. PTE requires O(|E|3/2 ) total work. Proof. Intersecting two sets requires comparisons as many times as the number of elements in the two sets. Accordingly, the number of operations performed by enumerate? ? ? TrianglesCD(Eij , Eik , Ejk ) is X ? ? dik (u) + djk (v) + O(1) (u,v)∈E ? ij ? where d?ik (u) is the number of u’s neighbors in Eik . We put ? ? O(1) since dik (u) + djk (v) can be smaller than 1. PTESC (as well as PTECD ) calls enumerateTrianglesCD for every possible triple (i, j, k) ∈ {0, · · · , ρ − 1}3 ; thus the total number of operations is as follows: ρ−1 X X ρ−1 X ρ−1 ? ? dik (u) + djk (v) + O(1) X i=0 j=0 k=0 (u,v)∈E ? ij =O(Eρ) + X ? ? d (u) + d (v) (u,v)∈E √ The left term O(|E|ρ) is O(|E|3/2 / M ) for checking all edges in each subproblem, which occurs also in PTEBASE . The right summation is the number of operations for intersecting neighbors, and it is O(|E|3/2 ) when the vertices are ? ordered by Definition p2 because the maximum value of d (u) for every u ∈ V is 2 |E| as proved in [25]. Note that it is the worst case optimal and the same as one of the best sequential algorithms [20]. 4.5 Implementation In this section, we discuss practical implementation issues of PTEs. We focus on the most famous distributed computation frameworks, Hadoop and Spark. Note that PTEs can be implemented for any distributed framework which supports map and reduce functionalities. PTE on Hadoop. We describe how to implement PTE on Hadoop which is the de facto standard of MapReduce framework. The graph partitioning method (Algorithm 1) of PTE is implemented as a single MapReduce round. The result of the graph partitioning method has a custom output format that stores each edge set as a separate file in Hadoop Distributed File System (HDFS); and thus each edge set is accessible by the path. The triangle enumeration method (Algorithms 2 or 3) of PTE is implemented as a single map step where each map task processes an (i, j) or (i, j, k) subproblem. For this purpose, we generate a text file where each line is (i, j) or (i, j, k), and make each map task read a line and solve the subproblem. PTE on Spark. We describe how to implement PTE on Spark, which is another popular distributed computing 5.1.2 Vertices Edges Triangles 1 42M 101M 1.4B 4.8B 6.3B 287M 457M 679M 947M 1.3B 1.8B 2.6B 4.0B 1.2B 1.9B 6.4B 7.9B 72B 260M 520M 1.0B 2.1B 4.2B 8.2B 17B 35B 34824916864 417761664336 85782928684 31012381940 3064499157291 184494 1458207 11793453 94016252 750724625 6015772801 48040237257 384568217900 Twitter (TWT) SubDomain (SD)2 YahooWeb (YW)3 ClueWeb09 (CW09)4 ClueWeb12 (CW12)5 ClueWeb12/256 ClueWeb12/128 ClueWeb12/64 ClueWeb12/32 ClueWeb12/16 ClueWeb12/8 ClueWeb12/4 ClueWeb12/2 framework. The reduce operation of Algorithm 1 is replaced by a pair of partitionBy and mapPartitions operations of general Resilient Distributed Dataset (RDD). The partitionBy operation uses a custom partitioner that partitions edges according to their vertex colors. The mapPartition operation outputs an RDD (namely edgeRDD) where each partition contains an edge set. A special RDD where each partition is in charge of a subproblem (i, j) or (i, j, k) wraps the edgeRDD; the edge sets used in each partition of the special RDD are loaded directly from the edgeRDD. 5. EXPERIMENTS In this section, we experimentally evaluate our algorithms and compare them to recent single machine and distributed algorithms. We aim to answer the following questions. Q1 How much do the three methods of PTE contribute to the performance improvement? (Section 5.2.1) Q2 How does PTE scale up in terms of the number of machines? (Section 5.2.2) Q3 How does the performance of PTE change depending on the underlying distributed framework (MapReduce or Spark)? (Section 5.2.3) We first introduce datasets and experimental environment in Section 5.1. After that, we answer the questions in Section 5.2 presenting the result of the experiments. 5.1 5.1.1 Setup Datasets We use real world datasets to evaluate the proposed algorithms. The datasets are summarized in Table 2. Twitter is a followee-follower network in the social network service Twitter. SubDomain is a hyperlink network among domains where an edge exists if there is at least one hyperlink between two subdomains. YahooWeb, ClueWeb09, and ClueWeb12 are page level hyperlink networks on the Web. ClueWeb12/k is a subgraph of ClueWeb12. We use random edge sampling to make ClueWeb12/k, with the sampling probability 1/k where k varies from 2 to 256. Each dataset is preprocessed to be a simple graph. We reorder the vertices in each edge (u, v) to be u ≺ v using an algorithm in [6]. These tasks are done in O(E). 1 http://an.kaist.ac.kr/traces/WWW2010.html http://webdatacommons.org/hyperlinkgraph 3 http://webscope.sandbox.yahoo.com 4 http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php? page=Web+Graph 5 http://www.lemurproject.org/clueweb12/webgraph.php 2 Experimental Environment We implement PTE on Hadoop (open source version of MapReduce) and Spark. Results described in Sections 5.2.1 and 5.2.2 are from Hadoop implementation of PTE; we also describe the Spark results in Section 5.2.3. We compare PTEs with previous algorithms: CTTP [23], MGT [13] and the triangle counting implementations on GraphLab and GraphX. CTTP is the state of the art MapReduce algorithm. MGT is an I/O efficient external memory algorithm. GraphX is a graph processing API on Spark, a distributed computing framework. GraphLab is another distributed graph processing framework using MPI. All experiments were conducted on a cluster with 41 machines where each machine is equipped with an Intel Xeon E3-1230v3 CPU (quad-core at 3.30GHz), and 32GB RAM. The cluster runs Hadoop v1.2.1, and consists of a master node, and 40 slave nodes. Each slave node can run 3 mappers and 3 reducers (120 mappers and 120 reducers in total) concurrently. The memory size for each mapper and reducer is set to 4GB. Spark v1.5.1 is also installed at the cluster where each slave is set to use 30GB memory and 3 cores of CPU; one core is for system maintenance. We operate GraphLab PowerGraph v2.2 on OpenRTE v1.5.4 at the same cluster servers. 5.2 Experimental Results In this section, we present experimental results to answer the questions listed at the beginning of Section 5. 5.2.1 Effect of PTE’s three methods Effect of pre-partitioning. We compare the amount of shuffled data between PTE and CTTP to show the effect of pre-partitioning (Section 4.1). Figure 6(a) shows the results on ClueWeb12 with various numbers of edges. It shows that PTE shuffles far fewer data than CTTP, and the difference gets larger as the data size increases; as the number of edges varies from 0.26 billions to 36 billions, the difference increases from 30× to 245×. The slopes for PTE and CTTP are 0.99 and 1.48, respectively. They reflect the claimed complexity of shuffled data size, O(|E|) of PTE and √ O(|E|1.5 / M ) of CTTP. Figure 6(b) shows the results on real world graphs; PTE shuffles far fewer data by up to 175× than CTTP does. Effect of color-direction. To show the effect of colordirection (Section 4.2), we count the number of all oper106 PTE CTTP 105 104 8 1.4 e= 103 102 101 245x p slo 30x 0.99 e= slop 100 -1 10 100 101 102 Number of edges (×109) Shuffled Data (GBytes) Dataset Shuffled Data (GBytes) Table 2: The summary of datasets. 105 PTE CTTP 104 103 162x 175x 85x 102 101 66x TWT SD YW CW CW 09 12 (a) (b) Figure 6: The shuffled data size of PTE and CTTP (a) on ClueWeb12 with various numbers of edges, and (b) on real world graphs. PTE shuffles up to 245× fewer data than CTTP on ClueWeb12; the gap grows when the data size increases. On real world graphs, PTE shuffles up to 175× fewer data than CTTP on ClueWeb09. PTEBASE PTECD CW12/256 CW12/128 CW12/64 CW12/32 CW12/16 CW12/8 CW12/4 CW12/2 CW12 TWT SD YW CW09 1.5 × 108 6.2 × 108 2.7 × 109 1.2 × 1010 4.9 × 1010 2.1 × 1011 8.9 × 1011 3.7 × 1012 1.6 × 1013 1.1 × 1012 8.3 × 1012 7.4 × 1011 2.7 × 1011 2.1 × 107 1.0 × 108 4.6 × 108 2.2 × 109 1.1 × 1010 5.4 × 1010 2.6 × 1011 1.2 × 1012 4.9 × 1012 5.4 × 1011 4.0 × 1012 2.9 × 1011 6.9 × 1010 PTEBASE PTECD 6.85 6.22 5.83 5.10 4.44 3.90 3.47 3.18 3.14 2.03 2.06 2.55 3.94 Table 4: The amount of data read via a network on CD ) is about 1.10 ≈ various graphs. The ratio ( PTE PTESC p 6/5 for all datasets, as expected. Dataset PTECD PTESC CW12/256 CW12/128 CW12/64 CW12/32 CW12/16 CW12/8 CW12/4 CW12/2 CW12 TWT SD TW CW09 22.5GB 58.8GB 174GB 485GB 1.3TB 3.7TB 9.9TB 26.9TB 72.9TB 102GB 206GB 1.9TB 2.9TB 18.7GB 51.5GB 162GB 426GB 1.2TB 3.3TB 9.0TB 24.1TB 64.3TB 83GB 191GB 1.7TB 2.6TB PTECD PTESC 1.20 1.14 1.08 1.14 1.09 1.10 1.10 1.12 1.13 1.23 1.08 1.13 1.11 ations in intersecting two neighbor sets (line 13 in Algorithm 3) in Table 3. PTECD reduces the number of comparisons by up to 6.85× from PTEBASE by the color-direction. Effect of scheduling computation. We also show the effect of scheduling calls of the function enuerateTrianglesCD (Section 4.3) by comparing the amount of data read via a network by PTECD and PTESC in Table 4. As expected, for p PTE every dataset, the ratio ( P T ECD ) is about 1.10 ≈ 6/5. SC Running time comparison. We now compare the running time of PTEs, and competitors (CTTP, MGT, GraphLab, GraphX) in Figure 7. PTESC shows the best performance and PTECD follows it very closely. CTTP fails to run when edges are more than 20 billions within 5 days. GraphLab, GraphX and MGT also fail when the number of edges is larger than 600 million, 3 billion and 10 billion, respectively, because of out of memory or out of range error. The out of range error occurs when a vertex id exceeds the range of 32-bit integer limit. Note that, even when MGT can treat vertex ids exceeding the integer range, the performance of MGT would worsen as the graph size increases since MGT performs massive I/O (O(|E|2 /M )) when the input data size is large. The slope of the PTEs is 1.26, which is smaller than 1.5. It means that the practical time complexity is smaller than the worst case O(|E|3/2 ). Figure 1 shows the running time of various algorithms on real world datasets. PTESC shows the best performances outperforming CTTP and MGT by up to 47× and 17×, respectively. Only the proposed algorithms succeed in processing ClueWeb12 with 6.3 billion vertices and 72 billion edges while all other algorithms fail to process the graph. Running time (sec) Dataset 106 Timeout (>120h) CTTP GraphX PTEBASE PTECD PTESC MGT GraphLab 105 104 slop 103 o.o.m. .26 e=1 Out Of Range o.o.m. 102 1 10 Number of edges 100 (×109) Figure 7: The running time on ClueWeb12 with various numbers of edges. o.o.m.: out of memory. PTESC shows the best data scalability; only PTEs succeed in processing the subgraphs containing more than 20B edges. The pre-partitioning (CTTP vs PTEBASE ) significantly reduces the running time while the effect of the scheduling function (PTECD vs PTESC ) is relatively insignificant. 5.2.2 Machine Scalability We evaluate the machine scalability of PTEs by measuring the running time of them and CTTP on YahooWeb varying the number of machines from 5 to 40 in Figure 8. Note that GraphX and GraphLab are omitted because they fail to process YahooWeb on our cluster. Every PTE shows strong scalability: the slope -0.94 of the PTEs is very close to the ideal value -1. It means that the running time decreases 20.94 = 1.92 times as the number of machines is doubled. On the other hand, CTTP shows weak scalability when the number of machine increases from 20 to 40. 5.2.3 PTE on Spark We implemented PTESC on Spark as well as on Hadoop to show that PTEs are general enough to be implemented in any distributed system supporting map and reduce functionalities. We compare the running time of the two implementations in Figure 9. The result indicates that the Spark implementation does not show a better performance than the Hadoop implementation even though the Spark implementation uses a distributed memory as well as disks. The reason is that PTE is not an iterative algorithm; thus, there are little data to reuse from memory. Even reading data Running time (sec) Table 3: The number of operations by PTECD and PTEBASE on various graphs. PTECD decreases the number of operations by up to 6.85× from PTEBASE . 106 PTECD PTESC 105 slope= -0 104 103 5 10 CTTP PTEBASE .94 20 40 Number of machines Figure 8: Machine scalability of PTEs and CTTP on YahooWeb. GraphLab and GraphX are excluded because they failed to process YahooWeb. PTEs show very strong scalability with exponent -0.94 which is very close to -1, the ideal. Running Time (min) 103 PTESC (Hadoop) PTESC (Spark) 102 101 100 TWT SD YW CW09 CW12 Figure 9: The running time of PTESC on Hadoop and Spark. There is no significant difference between them. from distributed memory, not from remote disk, does not contribute to performance improvement of PTE; they take similar time since disk reading and network transmission occur simultaneously. 6. CONCLUSION In this paper, we propose PTE, a scalable distributed algorithm for enumerating triangles in very large graphs. We carefully design PTE so that it minimizes the amount of shuffled data, total work, and network √ read. PTE operates in O(|E|) shuffled data, O(|E|3/2 / M ) network read, and O(|E|3/2 ) total work, the worst case optimal. PTE shows the best performances in real world data: it outperforms the state-of-the-art scalable distributed algorithm by up to 47×. Also, PTE is the only algorithm that successfully enumerates more than 3 trillion triangles in ClueWeb 12 graph with 72 billion edges while all other algorithms including GraphLab, GraphX, MGT, and CTTP fail. Future research directions include extending the work to support general subgraphs. Acknowledgments This work was supported by IT R&D program of MSIP/IITP [R0101-15-0176, Development of Core Technology for Humanlike Self-taught Learning based on Symbolic Approach]. This work was also funded by Institute for Information communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No.R0190-15-2012, ”High Performance Big Data Analytics Platform Performance Acceleration Technologies Development”). The ICT at Seoul National University provides research facilities for this study. 7. REFERENCES [1] Jesse Alpert and Nissan Hajaj. http://googleblog. blogspot.kr/2008/07/we-knew-web-was-big.html, 2008. [2] Shaikh Arifuzzaman, Maleq Khan, and Madhav V. Marathe. PATRIC: a parallel algorithm for counting triangles in massive networks. In CIKM, 2013. [3] Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient algorithms for large-scale local triangle counting. TKDD, 2010. [4] Jonathan W Berry, Bruce Hendrickson, Randall A LaViolette, and Cynthia A Phillips. Tolerating the community detection resolution limit with edge weighting. Phys. Rev. E, 83(5):056119, 2011. [5] Bin-Hui Chou and Einoshin Suzuki. Discovering community-oriented roles of nodes in a social network. In DaWaK, pages 52–64, 2010. [6] Jonathan Cohen. Graph twiddling in a mapreduce world. CiSE, 11(4):29–41, 2009. [7] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137–150, 2004. [8] Jean-Pierre Eckmann and Elisha Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. PNAS, 99(9):5825–5829, 2002. [9] Facebook. http://newsroom.fb.com/company-info, 2015. [10] Ilias Giechaskiel, George Panagopoulos, and Eiko Yoneki. PDTL: parallel and distributed triangle listing for massive graphs. In ICPP, 2015. [11] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17–30, 2012. [12] Herodotos Herodotou. Hadoop performance models. arXiv, 2011. [13] Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation. In SIGMOD, pages 325–336, 2013. [14] ByungSoo Jeon, Inah Jeon, Lee Sael, and U Kang. Scout: Scalable coupled matrix-tensor factorization - algorithm and discoveries. In ICDE, 2016. [15] U Kang, Jay-Yoon Lee, Danai Koutra, and Christos Faloutsos. Net-ray: Visualizing and mining billion-scale graphs. In PAKDD, 2014. [16] U Kang, Brendan Meeder, Evangelos E. Papalexakis, and Christos Faloutsos. Heigen: Spectral analysis for billion-scale graphs. TKDE, pages 350–362, 2014. [17] U Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos. Gbase: an efficient analysis platform for large graphs. VLDB J., 21(5):637–650, 2012. [18] U Kang, Charalampos E. Tsourakakis, and Faloutsos Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. ICDM, 2009. [19] Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. OPT: A new framework for overlapped and parallel triangulation in large-scale graphs. In SIGMOD, pages 637–648, 2014. [20] Matthieu Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., pages 458–473, 2008. [21] Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In PODS, pages 224–233, 2014. [22] Ha-Myung Park and Chin-Wan Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In CIKM, pages 539–548, 2013. [23] Ha-Myung Park, Francesco Silvestri, U Kang, and Rasmus Pagh. Mapreduce triangle enumeration with guarantees. In CIKM, pages 1739–1748, 2014. [24] Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. Defining and identifying communities in networks. PNAS, 101(9):2658–2663, 2004. [25] Thomas Schank. Algorithmic aspects of triangle-based network analysis. Phd thesis, University Karlsruhe, 2007. [26] Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607–614, 2011. [27] Twitter. https://about.twitter.com/company, 2015. [28] Mark N. Wegman and Larry Carter. New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci., 22(3):265–279, 1981. [29] Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, and Yafei Dai. Uncovering social network sybils in the wild. TKDD, 2014.
© Copyright 2026 Paperzz