Information Sciences 181 (2011) 3443–3458 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins Cost based in-network join strategy in tree routing sensor networks Jun-Ki Min a, Heejung Yang b, Chin-Wan Chung b,⇑ a b School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-myeon, Cheonan, Chungnam 330-708, Republic of Korea Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Republic of Korea a r t i c l e i n f o Article history: Received 30 November 2009 Received in revised form 6 October 2010 Accepted 3 April 2011 Available online 19 April 2011 Keywords: Query processing Sensor network Join Cost model a b s t r a c t The tiny and smart sensors enable applications which access a network of hundreds or thousands of sensors. In many applications, joins are used frequently to find relationships of readings of different sensors such as the correlation of sensor readings in distinct regions. In this paper, we present a cost based in-network join strategy called INJECT. Since the optimal join plan is determined according to various conditions such as data distributions and predicates of joins, it wastes the energy of sensors to use a fixed join plan blindly. Based on the analysis on how join queries can be handled in sensor networks, we devise several join plans. In particular, since the data transmission dominates the energy consumption of a sensor, we devise cost models each of which reflects the transmission cost of a join plan. Experimental results confirm that INJECT chooses the optimal or near optimal plan under various conditions. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction Wireless sensor networks (WSNs) are systems that are typically composed of a large number of sensors and the base station where a user can access data. Sensor nodes in sensor networks are severely constrained in terms of battery power. Replacing the battery of a sensor is either too expensive or impossible. The energy preservation is a major research issue since it directly impacts the life time of a network. Recent research has shown that radio communication is the most expensive. Thus, many techniques in diverse areas such as the routing protocol [9,12,16], event detection [1,22], in-network aggregation [13], and approximate data gathering [5,8,11,15] have been proposed in order to reduce the communication overhead. In-network aggregation provides a great opportunity for reducing the communication overhead using the summary data (e.g., SUM) and/or exemplary data (e.g., MIN and MAX). However, a single aggregated value is insufficient to analyze the whole sensor field in some applications [5]. Thus, some data gathering techniques [5,8,11,15] in sensor networks have been proposed. Periodic reporting of sensor readings drains the energy of sensors since it results in excessive communication. So, to reduce the communication overhead, in-network approximation techniques have been proposed. In this approach, data model [11] or data compression [15] technique are applied. In some applications, a user wants to identify the relationship between sensor readings in different regions. For example, a climatologist wants to analyze the correlation of the rain fall of a region and the temperature of another region. This regional correlation can be expressed as a join query of sensor readings in two regions. ⇑ Corresponding author. Tel.: +82 42 869 3537; fax: +82 42 869 3577. E-mail addresses: [email protected] (J.-K. Min), [email protected] (H. Yang), [email protected] (C.-W. Chung). 0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.04.017 3444 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 A naive plan to answer a join query is to gather sensor readings of two regions at the base station and to perform a join operation at the base station. This approach may waste much energy of sensors since sensor readings which will not participate in the join results have to be transmitted to the base station. An alternative plan is to perform a join in sensor networks in order to filter out unrelated sensor readings as soon as possible. However, using a fixed in-network join plan blindly may waste the sensor energy as much as the naive plan does. Thus, in this paper, we propose INJECT which is an In-Network Join stratEgy using Cost based optimization in Tree routing sensor networks. Tree based routing has been proposed as an energy efficient mechanism to transmit sensor readings. Due to its simplicity and manageability, many works for sensor networks are based on tree routing where a message from a node is passed to another node through a routing tree. Our work is also based on tree routing. Join processing has a long and rich history in the database field. However, due to the hierarchical structure of sensor networks, traditional techniques are not directly applicable. Thus, recently, research on in-network join processing [17,20–23] has been proposed to reduce the communication overhead. In in-network join processing, a node generates join results. However, some techniques just propose join plans without proper cost models. Also, although other work suggests cost models for in-network joins, the proposed cost models are too rough to be used in the query optimization. In addition, some in-network join plans are based on location based routing protocols such as GPSR. Thus, these plans are not applied to the tree-based routing environments. To the best of our knowledge, our work is the first for the cost-based optimization in tree routing sensor networks. Based on our cost model, accurate cost models for diverse in-network join techniques including the future techniques can be devised. Our contribution. Our work focuses on an efficient in-network join processing with a low transmission cost in order to conserve the energy of sensors. Let sets of sensor readings in a region QL and a region QR be L and R, respectively, and a join predicate of L and R be PJOIN(L, R). Then a join query is defined as LfflPJOIN ðL;RÞ R. In the result of the join, an element in L associates with many elements in R, and vice versa. INJECT considers semijoin operators as in-network join operators. Since a semijoin generates tuples which will participate in join results, the unrelated tuples can be filter out. Thus, the network communication overhead is reduced. In INJECT, we assume that actual join operation is performed at the base station which has unlimited power, using the results of semijoins (i.e., the results of L n R and R n L). INJECT has the following combination of contributions to perform an in-network join in an energy efficient manner. We propose an in-network join framework using cost based optimization to identify an efficient join plan on tree based routing networks. For INJECT, we devise diverse join plans in sensor network environments. First, we suggest three basic join plans: baseJoin, coverJoin, and sideJoin. And, by the analysis on the hierarchical structure of tree routing networks, we devise partitionJoin. In addition, we devise synopsisJoin by combining the synopsis technique and partitionJoin. We devise accurate cost models for diverse join plans in tree routing. For cost based optimization, an accurate cost model is an indispensable component. In our work, we make a basic cost model for gathering sensor readings in a query region and sending the gathered readings to a node in the tree based routing environments. Based on the basic cost model, we develop the cost models for the devised join plans. We provide an extensive experimental study of our framework in diverse environments. Our experimental results show that INJECT provides accurate cost models and therefore the most efficient query plan is selected. Organization of the paper. In the remainder of the paper, we present details of INJECT. In Section 2, we present the characteristics of our join operation. Section 3 describes three basic join plans and their cost models. In Section 4, we present enhanced join plans that we devise. Section 5 contains the experimental results. Section 6 outlines related works. Finally, in Section 7, we summarize our work and suggest some future studies. 2. Preliminaries We first describe some assumptions used in the paper and formalize the in-network join processing problem. Let us consider a set of sensor nodes S = {s1, s2, . . . , sn} located at positions {(x1, y1), . . . , (xn, yn)}, respectively. Sensor readings can be collected at the base station based on tree routing protocol. Each sensor keeps the hop distance which is the number of hops from the base station to it. Two nodes capable of bi-directional wireless communication are referred to as neighbors. Also, the base station knows the locations of sensors and the tree routing hierarchy among sensors. In INJECT, a join query is submitted to the base station. The base station generates diverse join plans which will be presented in Sections 3 and 4. Then, the base station identifies the optimal join plan among diverse join plans using their cost models. We analyze the join processing problem in sensor networks for a join query between relations. For the simplicity of presentation, we assume two relations. In reality, two-way joins are the most frequent, and also, they are the basis of m-way joins. A relation can be a set of sensor readings in a region as shown in Fig. 1. There are many motivating examples about in-network joins in related literatures. Here, we present an SQL form for a join. J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 3445 baseStation … … QL … QR Fig. 1. Routing tree and join scenario. SELECT L.⁄, R.⁄ FROM Sensor L, Sensor R WHERE L.location in QL AND R.location in QR AND PL(L) AND PR(R) AND PJOIN(L, R) {SAMPLE INTERVAL X} {FOR D} In the above SQL statement, relations L and R are the sets of sensor readings restricted by regions QL and QR, respectively. Also, PL and PR are selection predicates for the relations L and R, respectively. If there is no selection predicate for a relation, all tuples in a relation (i.e., all sensor readings in QL (or QR)) are sent to the base station. Additionally, PJOIN(L, R) is a join predicate for the relations L and R. Since the sensor readings are continuously generated, the above SQL statement is continuously executed. For this, we use TinyDB syntax [14]: SAMPLE INTERVAL and FOR. The query is executed once per X seconds for a period of D seconds. The SAMPLE INTERVAL and FOR terms are optional. The default X for SAMPLE INTERVAL is 1 s and D for FOR is 1. The problem that we intend to solve is formalized as follows: Problem definition. Given a sensor network consisting of a set of sensor nodes S = {s1, s2, sn}, tree routing is used to disseminate a query and collect sensor readings. Let sets of sensor readings in a region QL and a region QR be L and R, respectively, and the join predicate of L and R be PJOIN(L, R). Find the most effective plan to process a join query defined as LfflPJOIN ðL; RÞ R. In order to estimate a query cost, the selectivities of selection predicate PL and PR as well as join selectivity of join predicate PJOIN(L, R) are required. To obtain these statistics, the base station executes an in-network join plan, baseJoin (described in Section 3), and collects required statistics from join results in learning phases. Also, the query optimizer recomputes the costs of query plans using the history of the previous query results periodically and the base station broadcasts a more efficient plan, if exists. We omit the details of this since our work focuses on the diverse join plans and their accurate cost models. 3. In-network join processing In this section, we present three basic join plans considered in INJECT and suggest their cost models based on tree routing. Based on the basic join plans, some extended join plans will be presented in Section 4. The proposed join plans are summarized in Table 1. 3.1. Basic join plans In order to explain the basic join plans and their cost models, we define some basic concepts. Definition 1. The sensing region of a sensor s, SR(s) is a minimum bounding region that covers the locations of s’s descendants and itself in a routing tree. The descendants of a sensor s are all the nodes that are in its subtree. Definition 2. The covering node set of a region Q, cov(Q) is a set of nodes each of which has SR() containing the given region Q. 3446 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 Table 1 Summary of join plans. Join plan Intuition baseJoin coverJoin sideJoin partitionJoin synopsisJoin fullsynopsisJoin A join is performed at the base station Semijoins are performed at the node that can obtain all sensor readings of two regions Sensor readings of a region are sent to a node in the other side to perform a semijoin Sensor readings of a region are distributed into nodes in the other side to perform semijoins Similar to partitionJoin except sending a synopsis instead of sensor readings synopsisJoin is applied to both query regions s1 SR(s2) s2 s3 s4 s5 s7 s6 s8 Q Fig. 2. Sensing region SR(). Definition 3. The minimal cover node of a region Q, covmin(Q) is a node whose SR() covers the given region and is minimal. By Definitions 2 and 3, the node covmin(Q) for a region Q is an element in cov(Q). Also, since our data transmission is based on tree routing, covmin(Q) has the maximum hop distance from the base station among elements in cov(Q). For example, as shown in Fig. 2, for a given query region Q presented as a solid box, the sensing region of s2 (=SR(s2)) covers Q. Also, SR(s1) covers Q. Thus, cov(Q) = {s1, s2}. Note that s1 and s2 can collect all readings of sensors in Q. Since SR(s2) is smaller than SR(s1), covmin(Q) = s2. We use a rectangle to illustrate a region for easy understanding. However, any polygon can be used. Three basic join plans are plotted in Fig. 3. Roughly speaking, join plans are classified by the join node where a semijoin operation is performed. As mentioned earlier, a naive way to answer a join query is to gather all sensor readings at the base station and to perform a join at the base station. We call this plan baseJoin. Another join plan is that an intermediate node in the tree performs semijoins. An intermediate node can collect the sensor readings obtained from sensors in QL and QR in order to perform semijoins. Thus, the intermediate node (i.e., join node) is an S ancestor node of all sensors in QL and QR. Thus the join node is covmin(QL QR). We call this plan coverJoin. S In coverJoin, we do not consider the other nodes in cov(QL QR) as a join node. Suppose that a semijoin operation at a node S S si 2 cov(QL QR), which is not covmin(QL QR), is an efficient execution plan.It means that the size of the semijoin result is S smaller than that of all sensor readings. By Definition 3, a semijoin operation can be performed at covmin(QL QR). Thus, it S is more efficient that sending the semijoin results from covmin(QL QR) rather than si. The final basic join plan is sideJoin. In sideJoin, as shown in Fig. 3(c), the join column of sensor readings in QL is sent to the other side where a semijoin operation is performed. Thus, a semijoin operation is performed at a node which is an ancestor node of sensors in QR. Therefore, a semijoin is performed at covmin(QR). Fig. 3. Basic join plans. J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 3447 Among above three join plans, coverJoin and sideJoin are categorized into the in-network join approach since semijoins are performed at nodes in a network. In general, if the join selectivity is very high, the in-network join approach is useless. In contrast, if the join selectivity is very low, an in-network join approach is beneficial since the size of the join result is much smaller than that of sensor readings participating in a join operation. Therefore, by using the cost model, the optimal join plan should be selected. In our work, we only consider semijoin operators for in-network join processing. Note that, since the base station can get all the tuples (=L n R and R n L) participating in the join result, the join result can be computed at the base station without loss of information. S Consequently, in coverJoin, two semijoin operations (i.e. L n R and R n L) are performed at covmin(QL QR) and two semijoin results are sent to the base station. S In sideJoin, R n L is performed at covmin(QR). To do this, sensor readings in QL are sent to covmin(QL QR). And then the join S column, not whole sensor readings, is sent from covmin(QL QR) to covmin(QR). Therefore, R n L is performed at covmin(QR) and S then L n(R n L) is performed at covmin(QL QR) using the semijoin result (=(R n L)) obtained from covmin(QR). Since similar arguments hold when the join column of R is sent to covmin(QL), we only present the case that the join column of a relation L is sent to covmin(QR) in this paper. 3.2. Cost model In this section, we present cost models for INJECT. To estimate costs, various statistical estimations can be included. It will incur additional cost to maintain required statistics incrementally. Thus, in our cost model, we use simple but reasonable statistical estimation. However, techniques for statistical estimations can be orthogonally applied. As mention in Section 2, the base station identifies the optimal join plan in INJECT. In order to disseminate the optimal plan, the energy is consumed. Since the join plans should be propagated to all participating nodes in query regions, the dissemination costs of join plans are similar. In addition, the plan dissemination cost is quite small compared to the join processing cost since the join processing is performed for a period D. Therefore, we only consider the join processing cost. Since the transmission sizes (i.e., sizes of data to be joined) can be obtained using our model, the computation cost can be derived. However, it is a well known fact that the computing cost is ignorable compared to the transmission cost in sensor network environments. In the case of the Berkeley sensor motes, transmitting a single bit is equivalent to 800 instructions in terms of power consumption [13]. Thus, like related literature, we omit the computation cost. Furthermore, it is really hard to reflect all aspects to a cost model. Thus, an abstraction is required. For instance, in the query optimization in traditional databases, the disk I/O time is mainly considered although the computing time exists. In addition, the computing cost is mainly determined by the size of data to be joined. Thus, the total computing costs of in-network join plans are similar. Therefore, considering the communication overhead only is sufficient to choose an effective join plan. In addition, for simplicity, we do not consider a link failure. By retransmission, the link failure can be solved. Thus, using the retransmission probability, our cost model can be extended in a straightforward manner. These assumptions allow us to make a concise cost model. As shown in Fig. 4, to transmit sensor readings in the region Q, sensor readings are gathered at the node s(=covmin(Q)) through the tree based routing and then gathered sensor readings are sent to the node d. In addition, as mentioned earlier, the selection operation PQ is applied to sensor readings of the region Q. Let the selectivity of PQ be sel(PQ) which is the proportion of sensor readings that satisfy the selection operation. Note that sel(PQ) of a node which is not in the region Q is zero. Let the size of a sensor reading be r. Then, when the node s transmits a message, the transmission cost to a neighbor, T(s) (i.e., the average size of a message) can be expressed by Eq. (1). d s Q Fig. 4. Data transmission with a query region. 3448 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 baseStation … … DC cov min(Q L U Q R) DL … cov min(Q L) … DR cov min(Q R) Fig. 5. Hop distances. TðsÞ ¼ r selðP Q Þ þ X TðcÞ ð1Þ c2childðsÞ Using the transmission cost T(), we can compute the cost to send all readings in region Q to the node d. As mentioned earlier, the node s(=covmin(Q)) gathers the sensor readings in Q and s transmits the gathered readings to d. Thus, the cost is computed as follows: CostðQ ; dÞ ¼ Costgathering ðsÞ þ Costsending ðs; dÞ ð2Þ To gather the sensor readings at the node s, the children of s gather sensor readings from their descendants and send gathered readings to s. Thus, Costgathering(s) is expressed recursively as follows: Costgathering ðsÞ ¼ X ðCost gathering ðcÞ þ TðcÞÞ c2childðsÞ When the node s transmits data to its ancestors, the transmission costs T() of s0 ancestors are equal to that of s. Thus, the transmission cost from s to d is expressed as follows, where hopDiff(s, d) denotes the difference of hop distances between s and d: Costsending ðs; dÞ ¼ hopDiff ðs; dÞ TðsÞ Based on Eq. (2), we can derive the cost models of three basic join plans. Suppose that, as shown in Fig. 5, the difference of S S S hop distances from covmin(QL) to covmin(QL QR), that from covmin(QR) to covmin(QL QR), as well as that from covmin(QL QR) to the base station are DL, DR and DC, respectively. In baseJoin, the sensor readings in QL and QR are gathered at covmin(QL) and covmin(QR), respectively. Two gathered data sets S S move to covmin(QL QR). The node covmin(QL QR) sends the data sets came from two nodes to the base station. Thus, the cost of baseJoin costbase() can be derived as follows: Costbase ðQ L ; Q R Þ ¼ Cost gathering ðcov min ðQ L ÞÞ þ Costgathering ðcov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ R ÞÞ þ DC ðTðcov min ðQ L ÞÞ þ Tðcov min ðQ R ÞÞÞ ð3Þ In coverJoin, like baseJoin, the sensors in QL and QR send data to covmin(QL) and covmin(QR) as well as the gathered data is sent to S covmin(QL QR). However, in contrast to the baseJoin plan, as shown in Fig. 3(b), semijoin operations are performed at S covmin(QL QR). S In order to compute the transmission cost from covmin(QL QR) to the base station, we compute the size of L n R and the size of R n L. The join selectivity of a semijoin of L by R gives the fraction of tuples of L which join with tuples of R. Accurate estimation of the join selectivity is important since the effective join plan is based on the join selectivity. It is difficult to accurately estimate the join selectivity. But, an approximation for the semijoin selectivity was suggested in [10] as follows: SJSðRnA LÞ ¼ sizeðPA ðLÞÞ ; where domðAÞ is the domain of attribute A: sizeðdomðAÞÞ ð4Þ In the above equation, the semijoin selectivity SJS(R nA L) is only affected by the size of the join column A of L. Thus, when the size of the gathered sensor readings in QR at a certain time is T(covmin(QR)), the size of semijoin result is T(covmin(QR)) SJS(R nA L). We use SJSL.A to denote SJS(R nA L) concisely. Using the semijoin selectivities SJSR.A and SJSL.A, where the join S attributes of R and L are A, we can obtain the size of data generated at covmin(QL QR). The cost of coverJoin, costcover() is estimated as follows: Costcov er ðQ L ; Q R Þ ¼ Cost gathering ðcov min ðQ L ÞÞ þ Cost gathering ðcov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ R ÞÞ þ DC ðTðcov min ðQ L ÞÞ SJSR:A þ Tðcov min ðQ R ÞÞ SJSL:A Þ ð5Þ 3449 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 In the sideJoin plan, like the other join plans, all sensor readings are gathered at covmin(QL) and covmin(QR). But, unlike the other join plans, the join column A of L is sent to the covmin(QR) in order to perform R n L. S Since, from covmin(QL) to covmin(QL QR), the gathered data is sent, the transmission cost DL T(covmin(QL)) is required. And then, the projection result of the gathered sensor readings on the join attribute is sent to covmin(QR). In order to make the cost model concise, following the general convention, we simply assume that the cardinality of a projected result is equal to the cardinality of the original relation. Therefore, the transmission cost is reduced to Tðcov min ðQ L ÞÞ rj , where j is the size of a join column and r is the size of a sensor reading. So, the transmission cost of a join column of L to covmin(QR) (= DR Tðcov min ðQ L ÞÞ rj ) is required in sideJoin. And, as shown in Fig. 3(c), the semijoin R n L is performed at covmin(QR) and the semijoin result is sent back to S covmin(QL QR). As shown in the cost of coverJoin (Eq. (5)), we can obtain the size of the semijoin result using the join selectivity. The size of the semijoin R n L is T(covmin(QR)) SJSL.A. Thus, the transmission cost of the semijoin result from covmin(QR) S to covmin(QL QR) is DR T(covmin(QR)) SJSL.A. S Finally, at the node covmin(QL QR), the semijoin L n R (=L n (R n L)) is performed and the two semijoin results are sent to the base station. This transmission cost is DC (T(covmin(QL)) SJSR.A + T(covmin(QR)) SJSL.A). Therefore, the cost of sideJoin, costside(), is derived as follows: Costside ðQ L ; Q R Þ ¼ Costgathering ðcov min ðQ L ÞÞ þ Cost gathering ðcov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ R ÞÞ SJSL:A þ DC ðTðcov min ðQ L ÞÞ SJSR:A þ Tðcov min ðQ R ÞÞ SJSL:A Þ j r ð6Þ 4. Enhanced join plans In this section, we present other join plans called partitionJoin and synopsisJoin exploiting hierarchical structures (i.e., tree routing). Even though semijoin and synopsis join approaches are well known in distributed databases (i.e., a flat structure), our work is more general than the previous join approaches for distributed databases. In addition, to choose optimal join locations for partitionJoin and synopsisJoin, we devise a recursive expression for dynamic programming and its greedy version. 4.1. Partition join In coverJoin and sideJoin plans, semijoins are performed at one or two nodes. However, in the partitionJoin plan, semijoin operations are performed at several nodes. S S The basic intuition of partitionJoin is that cRc n L = R n L, where cRc = R. For general join operators, the intuition also holds. As shown in Fig. 6, a query region QR can be partitioned into several subregions. Each subregion can be obtained using the following definition. Definition 4. A subregion of a query region QR is an element in a set {Q R1 , Q R2 ; . . . ; Q Rn } such that, for each T c 2 child(covmin(QR)), Q Rc = SR(c) QR, where Q Rc – ;. An example of partitionJoin is described in Fig. 7. Since the node covmin(QR) gathers sensor readings from Q Rc , the semijoin Rc n L can be performed at c or covmin(QR). Thus, the problem of partitionJoin is to choose the set of child nodes which perform semijoin operations. This problem is recursively applied to the descendants of covmin(QR) since partitionJoin can be applied to a subregion Q Rc . For example, as shown in Fig. 7, the nodes s1 and s2 receive PA(L) and return R1 n L and R2 n L, respectively. But, s3 just sends R3. Thus, R3 n L is performed at covmin(QR). Note that, partitionJoin can be recursively applied to children of s1 and s2. Let costpartition(QL, QR) be the cost of the partitionJoin plan. Like sideJoin, the join column of L is sent to covmin(QR) and the S semijoin result of R n L is transmitted to covmin(QL QR) in partitionJoin. However, in contrast to sideJoin, the join column of L cov min(Q R) Q R1 Q R2 Q R3 QR Fig. 6. Partitions of QR. 3450 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 Fig. 7. An Example of partitionJoin. moves down to child nodes of covmin(QR). And, some child nodes whose SR() s cover Q Ri generate the result of Ri n L but the other child nodes simply return their gathered readings. Thus, the cost costpartition(QL, QR) is derived as follows: Costpartition ðQ L ; Q R Þ ¼ Costgathering ðcov min ðQ L ÞÞ þ Costpartial ðcov min ðQ L Þ; cov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ L ÞÞ j þ DR Tðcov min ðQ R ÞÞ SJSL:A þ DC ðTðcov min ðQ L ÞÞ SJSR:A þ Tðcov min ðQ R ÞÞ SJSL:A Þ r ð7Þ The term Costpartial() is used in Costpartition(QL,QR) instead of Cost gathering(covmin(QR)) in Costside(QL, QR). When a node sends a message to another node, the receiving node consumes energy to receive the message. The receiving cost is generally proportional to the sending cost.1 Here, we use Ts() and Tr() in Costpartial() instead of T() since, by broadcasting from a parent, several children receive a message in WSN environments. The term Costpartial(covmin(QL), covmin(QR)) is as follows: ( j X j X Costpartial ðcov min ðQ L Þ; cov min ðQ R ÞÞ ¼ minC # qchildðcov min ðQ R ÞÞ T s ðcov min ðQ L ÞÞ þ T r ðcov min ðQ L ÞÞ þ TðcÞ SJSL:A r c2C r c2C ) X X 0 0 þ ðCost partial ðcov min ðQ L Þ; cÞ þ ðCost gathering ðc Þ þ Tðc ÞÞ ; c2C c0 2qchildðcov min ðQ R ÞÞC where qchildðcov min ðQ R ÞÞ is a set of children whose SRðÞ overlap with Q R and T s ðcov min ðQ L ÞÞ ¼ 0 if C ¼ ; ð8Þ In Costpartial(), a subset C of qchild(covmin(QR)) is the set of nodes which generate the results of semijoins. To do this, PA(L) is P sent to C. Its cost is T s ðcov min ðQ L ÞÞ rj . Then, all nodes in C receive PA(L). The cost is c2C T r ðcov min ðQ L ÞÞ rj . The transmission cost of Rc n L from c 2 C is T(c) SJSL.A. Since partitionJoin is applied recursively to c, Costpartial(covmin(QL), c) represents this recursion. Also, since a child c0 , which does not perform a semijoin, simply gathers sensor readings of descendants and transmits to its parent, the term (Costgathering(c0 ) + T(c0 )) is used. If C is empty, Ts(covmin(QL)) = 0 since PA(L) does not need to be transmitted to the children. Note that, if it is beneficial that covmin(QR) does not transmit PA(L) to child nodes, Costpartition() is equal to Costside(). Therefore, sideJoin is a specific plan of partitionJoin. In the optimal plan obtained from the above recursive expression in Eq. (8), each child has the optimal plan. In other words, the principle of optimality holds. Thus, dynamic programming can be applied to find the optimal join plan. The time complexity of dynamic programming is O(2n), where n is the number of descendant nodes of covmin(QR), since all subsets of descendants are evaluated. In order to reduce the time complexity, we devise a greedy method. Fig. 8 shows a greedy algorithm for partitionJoin. This algorithm traverses the subtree of covmin(QR) in a breadth-first traversal manner. The procedure compute_partial() performs the following: Candidate children which may generate the semijoin results are computed (Lines 7–14). If a child node c 2 qchild() generates the semijoin result, c receives PA(L) and returns the result of Rc n L. Thus, if the transmission cost T(c) is greater than the sum of the receiving cost of PA(L) and the transmission cost of the semijoin result (Line 8), it may be beneficial that the semijoin is performed at c. If c is a candidate, the cost sums up as costC (Line 10). Children for generating the semijoin results are computed (Line 15–21). In order to perform semijoins at a subset C, a P parent node (i.e., aNode) sends PA(L). Thus, when the benefit (= c2C TðcÞ costC ) is greater than T s ðcov min ðQ L ÞÞ rj , doing semijoins at C is efficient (Line 15). If it is beneficial to perform the semijoins at C, we can further consider the children of C (Line 16). The time complexity of our algorithm is O(n) since the algorithm traverses child nodes in the breadth first traversal manner. Costpartial(covmin(QL), covmin(QR)) based on the heuristic is computed as Costpartial (Lines 3–22). Costpartial is initialized by 0 (Line 3). If it is beneficial to perform semijoins at C (Line 15), Costpartial is added to the sending cost from parent to C (=ðT s ðcov min ðQ L ÞÞ rj ) and costC (Line 17). If it is not beneficial to perform a semijoin at a child c, Costpartial is added to Costgathering(c) and T(c) (Lines 12, 20). 1 In [18], the authors say that the receiving cost is generally 60% less than the sending cost. J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 3451 Fig. 8. A greedy algorithm for partition join. Thus, the query optimizer can choose the best plan between partitionJoin and sideJoin by the comparison of Costpartial and Costgathering (covmin(QR)). 4.2. Synopsis join In partitionJoin presented in Section 4.1, the transmission cost for PA(L) is a key factor to apply partitionJoin. Thus, if we can reduce this cost, a more efficient join plan can be obtained. A synopsis is a summary of a relation. By using a synopsis, we may reduce this cost. We call this method synopsisJoin. In our work, with respect to the join condition, different synopses are used. As mentioned in [22], when a join is not an equi-join, sending the min (or max) value is sufficient to perform a join. For example, if a join condition is L.A < R.A, the minimum value of L.A is only required to perform R n L since all tuples in R with R.A values greater than the minimum value of L.A participate in a join operation. For an equi-join, we use the bloom filter [3]. The bloom filter consists of an array of m bits and a set of k independent hash functions each of which maps an element to an integer in the range of [1, m]. An element in a set is represented in the bloom filter by setting all positions, computed by hash functions, of the bit array to 1. We can check a membership using the bloom filter. Suppose that the bloom filter is constructed using a set of attribute values of L.A. If at least one of the positions related to an attribute value of the array is 0, the attribute value is not a member. Thus, by using the membership checking feature of the bloom filter, an equi-join is performed without the original relation. There can be some false positives associated with using the bloom filter, however they are not significant. synopsisJoin is divided into two specific plans: synopsisJoin and a variant called fullsynopsisJoin. S The synopsisJoin plan is similar to partitionJoin except that a synopsis of PA(L) is sent from covmin(QL QR) to covmin(QR). In the previous plans, the size of data to be transmitted is computed. But, since the fixed sized synopsis is transmitted, a difference model is required. S In synopsisJoin, covmin(QL QR) may not send a synopsis of PA(L) if L does not come from covmin(QL). Let m be the size of a synopsis and P(s) be the probability that a node s transmits data. Then, P(s) = 1 Prob (node s does not transmit data). The transmission probability P() is determined by a query region and the selectivity of a selection predicate. In Fig. 4, in order for s not to transmit data, the sensor reading of the node s does 3452 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 not satisfy the selection condition PQ as well as all children of s do not transmit data to s. Therefore, P(s) can be expressed by Eq. (9). PðsÞ ¼ 1 ð1 selðPQ ÞÞ Y ð1 PðcÞÞ ð9Þ c2childðsÞ So, the average transmission cost for a synopsis of PA(L), TsysL, is derived as follows: T sysL ¼ m Pðcov min ðQ L ÞÞ ð10Þ S Therefore, DR TsysL is the synopsis transmission cost from covmin(QL QR) to covmin(QR). And then, similar to partitionJoin, a synopsis of PA(L) is propagated to descendants of covmin(QR) to perform semijoins. This cost is derived as Eq. (11). TsysL,s and TsysL,r are the sending and receiving cost of a synopsis of PA(L), respectively. Additionally, in Fig. 8, Tr, and Ts at Lines 8, 10, 15, and 17 are replaced with TsysL,r and TsysL,s, respectively, for the greedy algorithm of synopsisJoin. Costpartial;synopsis ðcov min ðQ L Þ; cov min ðQ R ÞÞÞ ¼ minC # qchildðcov min ðQ R ÞÞ fT sysL;s þ þ X X T sysL;r þ c2C c2C Costpartial;synopsis ðcov min ðQ L Þ; cÞ þ TðcÞ SJSL:A X ðCost gathering ðc0 Þ v min ðQ R ÞÞC c0 2qchildðco c2C þ Tðc0 ÞÞg; X where T sysL;s ¼ 0 if C ¼ ; ð11Þ S In synopsisJoin, L is sent to covmin(QL QR). In this case, since some tuples in L will not participate in the join result, it wastes the energy of sensors. To solve this problem, we devise the fullsynopsisJoin plan. In fullsynopsisJoin, covmin(QL) sends a synopsis of PA(L) instead of S S L itself to covmin(QL QR), and then covmin(QL QR) sends the synopsis to covmin(QR). fullsynopsisJoin consists of four steps. Generating a synopsis of PA(L) on QL. Performing synopsisJoin plan on QR. Performing synopsisJoin plan on QL with the synopsis of PA(R n L) using the second step’s result. Sending the semijoin results to the base station. At the first step, when the minimum value of attribute L.A is used as a synopsis, each node in QL sends the minimum value among data from children and its reading. When the bloom filter is used, a node in QL generates the bloom filter by ORing of child nodes’ bloom filters and inserting the hashed value of its reading since the bloom filter BFU for a set U is equal to _i BF Ui , S where Ui = U. Thus, the cost of the first step Costcon(covmin(QL)) is derived as follows: Costcon ðcov min ðQ L ÞÞ ¼ X ðCost con ðcÞ þ PðcÞ mÞ ð12Þ c2childðcov min ðQ L ÞÞ At the second step, the synopsis is sent to covmin(QR) through covmin(QL for the second step is as follows: S QR) and synopsisJoin is applied on QR. Thus, the cost DL T sysL þ DR T sysL þ Cost partial;synopsis ðcov min ðQ L Þ; cov min ðQ R ÞÞ ð13Þ As a result of the second step, the result of R n L is collected at covmin(QR). At the third step, this result is sent to covS S QR). Using this result, a synopsis of PA(R n L) is made at covmin(QL QR) and sent to covmin(QL). And then, synopsisJoin min(QL is applied on QL to obtain L n R. Thus, the cost for the third step is derived as follows: DR Tðcov min ðQ R ÞÞ SJSL:A þ DL T sysR þ Cost partial;synopsis ðcov min ðQ R Þ; cov min ðQ L ÞÞ ð14Þ In the second term of Eq. (14), we use TsysR as the transmission cost for the synopsis of PA(R n L). When a set of tuples in R appears with the probability P(covmin(QR)), if the semijoin selectivity is not zero, a synopsis of PA(R n L) is not empty. Thus, TsysR is sufficient to use as a the transmission cost for the synopsis of PA(R n L). S Finally, covmin(QL QR) receives the result of L n R from covmin(QL) and sends the union of R n L and L n R to the base station. This cost is DL T(covmin(QL) SJSR.A + DC (T(covmin(QL)) SJSR.A + T(covmin(QR)) SJSL.A). S Here, in order to maintain the consistency of the other plans, we explain that covmin(QL QR) keeps the results of R n L to send the union of R n L and L n R. However, actually, the result of R n L is generated at the second step and that of L n R is generated at the third step. Thus, it is possible to send each semijoin result to the base station at the end of each step without S keeping the result of R n L at covmin(QL QR). 4.3. Memory requirement of in-network join plans Generally, sensor nodes have limited memory space. The memory requirement of each join plan should be considered. In this section, we briefly present the memory requirement of each join plans due to the space limitation. J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 3453 S In coverJoin, covmin(QL QR) keeps whole L and R to perform R n L and L n R. S In sideJoin, covmin(QR) keeps R to perform R n L and covmin(QL QR) keeps L to perform L n R. Compared to sideJoin, partitionJoin has an advantage for the memory space. In partitionJoin, a sensor c in QR needs to keep Rc R and PA(L) to perform Rc n L. Thus, when R is partitioned into a set of Rc’s, if the sum of the size of PA(L) and the maximum size of Rc is less than the memory limitation, partitionJoin can be performed. synopsisJoin reduces the memory requirements of sensors in QR compared to partitionJoin since a synopsis of PA(L) is used instead of PA(L). S However, in sideJoin, paritionJoin, synopsisJoin, L should be kept in covmin(QL QR) to perform L n R. Thus, if the memory S space of a sensor (i.e., covmin(QL QR)) is smaller than the size of L, sideJoin, partitionJoin and synopsisJoin cannot be applied. In contrast to the other plans, in fullsynopisJoin, whole relations L and R are not required at either of the two query region since synopisJoin plan is applied on both QL and QR. Thus, the memory requirement of fullsynopsisJoin is the smallest among the devised in-network join plans. 5. Experiments In this section, we demonstrate the effectiveness of INJECT and show the efficiency of our proposed in-network join plans. Thus, to show the effectiveness of INJECT, we implement diverse join plans: baseJoin, coverJoin, sideJoin, partitionJoin, synopsisJoin and fullsynopsisJoin. We empirically compared the performances of devised join plans and show that INJECT chooses the optimal or near optimal plan over the diverse environments. 5.1. Experimental environment In this section, we present the features of the experimental data set and the parameters to configure diverse environments. The default network configuration of experiments is 10 10 grid and sensor nodes are placed in each grid point. The base station is located on the upper left corner. The routing tree is constructed using the FHF (First-Heard-From) network configuration algorithm [14]. The maximum hop distance of the network tree is eight. As shown in our basic cost model in Eq. (2), the transmission cost is affected by the selection selectivity and the size of a query region. In addition, the join selectivity affects the communication cost. Thus, to simulate diverse conditions of networks, we use some parameters. The default parameter settings used in our experiments are summarized in Table 2. We set the size of a tuple to 44 bytes and the size of join attribute to 8 bytes. For synopsisJoin and fullsynopsisJoin, the bloom filter is used. We set the size of a bloom filter to 30 bytes. In our experiments, we run the join query for an interval of 1000 epochs. We show the accumulated costs of estimation using the proposed cost models and those of actual execution. Generally, according to the sensor’s type, the energy consumption of data transmission is different. And, our cost model is based on the size of data transmission. Thus, we use the amount of transmitted data as the performance metric.2 5.2. Experimental results In each experiment, we vary one of the parameters and show its effect. With a few exceptions, the estimated cost accurately reflects the relative rank for each plan. In previous sections, we only present the case that PA(L) (or a synopsis of PA(L)) is sent to covmin(QR). However, except baseJoin and coverJoin, the join column of a relation is sent to the opposite side. For example, PA(R) can be sent to covmin(QL) in sideJoin. Thus, in our experimental result, we attach the prefixes L and R in each plan to represent these cases. The prefix L denotes that a semijoin is performed at QL side (i.e., PA(R) is sent to covmin(QL)). The prefix R denotes the opposite case. 5.2.1. Join selectivity To show the effect of the join selectivity, we set SJSL.A for R n L to 0.5 and vary the semijoin selectivity (SJSR.A) for L n R from 0.1 to 1.0. Fig. 9 shows the results of the estimated costs and actual costs of diverse plans. As shown in Fig. 9, the pattern of estimate costs is quite similar to that of the actual costs. INJECT estimates that LfullsynopsisJoin is superior to other plans when SJSR.A is from 0.1 to 0.7 and RsynopsisJoin is the best plan when SJSR.A is from 0.8 to 1. In the result of actual costs shown in Fig. 9(b), LfullsynopsisJoin shows the best performance when SJSR.A is from 0.1 to 0.6. And RsynopsisJoin is superior when SJSR.A is from 0.8 to 1. When SJSR.A is 0.7, the RfullsynopsisJoin shows the best performance. But the performance of LfullsynopsisJoin is very close to that of RfullsynopsisJoin. This result confirms that INJECT chooses the optimal or near optimal plan among diverse in-network join plans. 2 Since the transmission size of each node can be estimated using our cost model, the other measures such as the number of data transmissions and energy consumption can be derived easily. However, we omit it due to the space limitation. 3454 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 Table 2 Parameters. Parameter Default value Comments sel(PQ) SJS DC DL DR jQLj jQRj 0.5 0.5 1 2 2 10 10 Selection selectivity for a region Q Semijoin selectivity See Fig. 5 See Fig. 5 See Fig. 5 # of nodes in QL # of nodes in QR Generally, the transmission costs of basic join plans (baseJoin, coverJoin, sideJoin) are higher than the enhanced join plans (partitionJoin, synopsisJoin, fullsynopsisJoin). Among the enhanced join plans, the costs of synopsisJoin and fullsynopsisJoin are much less than those of the other plans in almost all cases. Since, in baseJoin, all tuples of L and R satisfying selection predicates are sent to the base station, the semijoin selectivity does not affect the performance of baseJoin. S In coverJoin, L and R are sent to covmin(QL QR) and the results of L n R and R n L are sent to the base station. Thus, since the size of L n R increases as SJSR.A increases, the cost of coverJoin increases. Now we consider LsideJoin and RsideJoin plans. As SJSR.A increases, the number of L’s tuples participating in a semijoin result increases. In other words, the size of L n R is smaller than that of R n L when SJSR.A is smaller than SJSL.A. Therefore, when SJSR.A is smaller than 0.5, LsideJoin is better than RsideJoin since many tuples of L are filtered out by the semijoin L n R at covmin(QL). But, when SJSR.A is greater than 0.5, RsideJoin is better than LsideJoin. Similar arguments hold on LpartitionJoin and RpartitionJoin as well as LsynopsisJoin and RsynopsisJoin. In contrast to the other join plans, LfullsynopsisJoin and RfullsynonsisJoin show the similar performance since semijoins using synopses are performed at both sides. The difference of LfullsynopsisJoin and RfullsynonsisJoin is that an initial synopsis is generated at which side. This cost is mainly affected by the selection selectivity sel(PQ) for a query region Q. As mentioned in Sections 4.1 and 4.2, in partitionJoin and synopsisJoin plans, a query region Q is partitioned into subregions Qc and a join column or a synopsis is distributed to subregions. However, as mentioned above, as SJSR.A increases, the number of tuples of L which will contribute to a join result increases. Thus, in each subregion, a small number of tuples will be filtered out. But the cost to distribute PA(R) or a synopsis of PA(R) is required. Thus, when SJSR.A is greater than 0.7, our partition algorithm computes that no partitioning of QL is beneficial. In these cases, LpartitionJoin acts as LsideJoin. Compared to LpartitionJoin, LsynopsisJoin is slightly better since a synopsis of PA(R) instead of PA(R) is sent to covmin(QL). In contrast, QR is partitioned into several subregions in RpartitionJoin and RsynopsisJoin since SJSL.A is fixed at 0.5. Thus, RpartitionJoin and RsynopsisJoin show better performance than RsideJoin. LfullsynopsisJoin and RfullsynopsisJoin show the best performance when SJSR.A is smaller than 0.7 since the partition strategy is applied to both side and many tuples are filtered out. However, when SJSR.A is greater than 0.7, LfullsynopsisJoin and RfullsynopsisJoin are worse than RpartitionJoin and RsynopsisJoin since L/RfullsynopsisJoin act as LsideJoin on QL although L/ RfullsynopsisJoin behave on QR like RsynopsisJoin. One of interesting points is that the costs of LsideJoin, LpartitionJoin, and LsynopsisJoin are greater than that of coverJoin when SJSR.A is greater than 0.7. In particular, when SJSR.A is 1.0, the costs of LsideJoin and LpartitionJoin approach to that of baseJoin. This result indicates that using a fixed in-network join plan blindly wastes the energy as much as a naive join plan (i.e., baseJoin), and therefore, cost based optimization is required. Transmission Cost (x1000) Transmission Cost (x1000) 2400 2200 2000 1800 1600 1400 1200 1000 800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2400 baseJoin 2200 coverJoin 2000 LsideJoin 1800 RsideJoin 1600 LpartitionJoin 1400 RpartitionJoin 1200 LsynopsisJoin 1000 800 RsynopsisJoin 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SJS R.A SJS R.A LfullsynopsisJoin RfullsynopsisJoin (a) Estimated Cost Fig. 9. Join selectivity results (SJSL. (b) Actual Cost A = 0.5). 3455 3400 Transmission Cost (x1000) Transmission Cost (x1000) J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 2900 2400 1900 1400 900 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 sel(PQR) (a) Estimated Cost 3400 baseJoin coverJoin 2900 LsideJoin 2400 RsideJoin 1900 LpartitionJoin 1400 RpartitionJoin 900 LsynopsisJoin 400 RsynopsisJoin 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 LfullsynopsisJoin sel(PQR) RfullsynopsisJoin (b) Actual Cost Fig. 10. Selection selectivity results (selðP Q L Þ ¼ 0:5). Compared to actual costs, the average of relative error rates (=j(estimated cost actual cost)/actual costj) is about 5% and the maximum of the relative error rates is 18%. Therefore, the proposed cost models accurately estimate the transmission costs. 5.2.2. Selection selectivity This experiment tests the effect of the selection selectivity. We set the selection selectivity of QL (= selðPQ L Þ) to 0.5 and vary the selection selectivity of QR (= selðPQ R Þ) from 0.1 to 1.0. Other parameters are all set to their default values. The results are shown in Fig. 10. The patterns of the estimated costs and the actual costs are also similar. The average of relative error rates is about 7%. In this experiment, INJECT chooses LpartitionJoin as the optimal plan when selðP Q R Þ is small (i.e., 0.10.3). As shown in Fig. 10(b), LfullsynopsisJoin shows the best performance when selðPQ R Þ is small. However, the performance gap between LfullsynopsisJoin and LpartitionJoin is quite small. In addition, as expected by INJECT, RfullsynopsisJoin shows the best performance on other cases. As the selection selectivity increases, the transmission cost also increases. Thus, in contrast to the result for baseJoin in Fig. 9, the cost of baseJoin increases. The cost of coverJoin also increases. But, since R n L and L n R instead R and L are sent from covmin(QL [ QR) to the base station in coverJoin, coverJoin is better than baseJoin. As selðP Q R Þ increases, the number of tuples of R satisfying the selection predicate increases. Thus, it is beneficial that a semijoin is performed at QR side. Therefore, as shown in Fig. 10, when selðPQ R Þ is greater than selðP Q L Þð¼ 0:5Þ, a plan in which a semijoin is performed at QR side (e.g., RsideJoin) has a smaller transmission cost than its counterpart (e.g., LsideJoin). As mentioned in Section 5.2.1, the behaviors of LfullsynopsisJoin and RfullsynopsisJoin are similar except that an initial synopsis is constructed at which side. As presented in Eqs. (9) and (12), the synopsis construction cost is affected by the selection selectivity. In LfullsynopsisJoin, the initial synopsis is constructed at QR side. As selðP Q R Þ increases, the synopsis construction cost on QR increases. Thus, the performance of LfullsynopsisJoin becomes worse than that of RfullsynopsisJoin. Recall that, as mentioned in Section 5.2.1, partitioning a query region is beneficial when the join selectivity is not high. In this experiment, we set SJSR.A and SJSL.A to 0.5. Therefore, as selðP Q R Þ increases, the performance gap between RsideJoin which gathers tuples satisfying a selection predicate at covmin(QR) and the join plans applying partition strategy on QR (i.e., RpartitionJoin, RsynopsisJoin, L/RfullsynopsisJoin) increases. 5.2.3. Cover node depth In this experiment, we vary the hop distance and show its effect. We set DC and DL to 1 and 2, respectively. We change DR from a smaller value to a larger value than DL. The results are shown in Fig. 11. As shown in Fig. 11, since we set the selection selectivity and join selectivity at each side to the same, when DR and DL are the same (=2), the performance of a plan performing a semijoin at QR and that of its counterpart are the same. Although the size of transmitted data of covmin(QL) and that of covmin(QR) are the same, nodes having a larger hop distance S to covmin(QL QR) have a larger transmission cost. In the case of LsynopsisJoin and RsynopsisJoin, the smaller cost plan is changed from RsynopsisJoin to LsynopsisJoin as DR increases. Other plans also show similar patterns. The average relative error rate of all cases is about 6%. 5.2.4. Memory requirement Sensor nodes generally have limited memory space. Therefore, the memory consumption of each join plan should be considered. In this section, we show the minimum and maximum memory requirements of each in-network join plan with default parameters. In Table 3, the minimum size and maximum size (in bytes) of accumulated memory requirement of a node as well as the corresponding nodes in parentheses are presented. Lci is a child of covmin(QL). 3456 J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 Transmission Cost (x1000) Transmission Cost (x1000) 2800 2600 2400 2200 2000 1800 1600 1400 1200 1000 800 1 2 3 4 2800 2600 baseJoin coverJoin 2400 2200 LsideJoin 2000 1800 RsideJoin LpartitionJoin 1600 1400 RpartitionJoin 1200 LsynopsisJoin 1000 800 RsynopsisJoin 1 DR 2 3 4 DR (a) Estimated Cost LfullsynopsisJoin RfullsynopsisJoin (b) Actual Cost Fig. 11. Cover node depth results (DL = 2). Table 3 Memory requirement. Join plan MIN data size MAX data size coverJoin LsideJoin RsideJoin LpartitionJoin RpartitionJoin LsynopsisJoin RsynopsisJoin fullsynopsisJoin S 443,344 (covmin(QL QR)) S 220,880 (covmin(QL QR)) 220,880 (covmin(QR)) 61,764 (Lc2) 62,536 (Rc2) 21,018 (Lc2) 22,118 (Rc2) 21,018 (Lc5) 222,464 (covmin(QL)) S 222464 (covmin(QL QR)) S 220,880 (covmin(QL QR)) S 222,464 (covmin(QL QR)) S 220,880 (covmin(QL QR)) S 222,464 (covmin(QL QR)) 44,998 (Lc6) S Because, in coverJoin, covmin(QL QR) keeps whole L and R to perform R n L and L n R, coverJoin requires larger memory space. LsideJoin and RsideJoin reduce the memory requirement because one relation of L and R should be kept at each cover node. partitionJoin requires smaller memory space than sideJoin because a relation is partitioned into subregions. In the case of synopsisJoin, the required memory space is smaller than the partitionJoin because a synopsis is used instead of the projection S result. However, in partitionJoin and sideJoin, covmin(QL QR) keeps the other relation. fullsynopsisJoin reduces required memory space much more because the partitioning strategy is applied to both query region. Therefore, although the memory space of a sensor is smaller than a relation, fullsynopsisJoin can be applied. 6. Related work Joins are common in applications for target tracking, event detecting, correlation analysis, and so on. In the database literature, several sensor data management systems such as Cougar [7] and TinyDB [14] have been introduced. However, these systems do not support joins efficiently. Basically, in these systems, joins are performed at the base station like the baseJoin plan. Recently, some works for the in-network join processing have been conducted. Albadi et al. propose REED [1] which is a distributed join algorithm for event detection. In REED, some conditions for event detection are specified as a stable relation. REED supports joins between sensor data and a static relation built outside the networks. Thus, they do not consider joins of sensor readings in distinct regions. Bonfils and Bonnet [4] suggest an adaptive algorithm for finding the optimal join location. Similar work is performed [19] for the environments that data is transmitted through a hierarchy of network nodes with progressively increasing computing power and network bandwidth. This work considers that the join is performed at only one node. Yang et al. [22], propose the two-phase self join (TPSJ) approach. TPSJ is a kind of the semantic optimization technique. In TPSJ, the base station gathers some sensor readings which act as a stable relation like REED [1]. Then, the gathered sensor readings are distributed through sensor networks and each sensor performs join operation. In TPSJ, the distribution cost of gathered readings is reduced if currently gathered readings are contained in previously distributed readings. But in order to check the containment, the sensor reading composing the stable table should be transmitted to the base station. Some in-network join methods between two regions in sensor networks were presented in [6,17,20]. [17] considers the optimal join location using the cost model. In this work, the optimal join location is near to the weighted centroid of three points: two center points of two query regions and the base station. But, the cost models used in these methods are not J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 3457 accurate since they did not consider join selectivities and they assume that all sensor readings in a query region are collected at the center of the region. A similar work is done in [20], in this work, as the optimal join location, the weighted Fermat point is obtained. Coman et al. [6] presented three join plans which are similar to our three basic join plans. External join in [6] is identical to our baseJoin. Mediated join [6] is similar to our coverJoin since two relation is joined at a node. Like [20] the weighted Fermat point is considered as the join location in mediate join. But, since our work is based on tree routing, the join location of S S coverJoin is covmin(QL QR). If we align the routing tree to make the weighted Fermat point covmin(QL QR), coverJoin is quite similar to mediate join. Local join [6] is similar to our sideJoin since a relation is transmitted into the counterpart. Unlike sideJoin, local join distributes a relation into all nodes in the counterpart region. Note that, in partitionJoin, a generalization of sideJoin, a relation (actually, a join column) is distributed with respect to the cost model. Thus, with the cost model of partitionJoin, we can obtain the more efficient distribution plan. All work above except [6] focuses on a general join operator. In order to reduce the communication overhead, we consider semijoins and synopsis joins. Semijoins and synopsis joins are used widely in the distributed database environments [2]. However, these methods are not applied directly since sensor networks are hierarchically structured. In [23], a synopsis join algorithm for in-network join processing is proposed. In this work, a histogram is used as a synopsis. The authors suggest a synopsis join location based on the cost model. In the aspect of using synopsis joins, the work of [23] is similar to ours. But, the cost model used in [23] is too rough to use in query optimization since the purpose of the cost model is finding the weighted centroid of two regions’ centers. Furthermore, this technique cannot be used in tree routing environments since the synopsis join is performed at any node (i.e., a weighted centroid) in a network, like [17]. Stern et al. propose SENS-Join method [21] in order to avoid shipping tuples through the network that do not join. The join plan of SENS-Join consists of two phase like [23]. In [21], SENS-Join uses a compact representation of join attribute values based on the quad tree. Some techniques used in [21], such as gathering values of a join attribute rather than whole tuples, can be applied in our work orthogonally. However, in [21], the cost model for SENS-Join is not presented. As mentioned above, according to various conditions, the optimal query plan is changed. Thus, the cost model is an indispensable component to choose the best query plan. Also, they only show the efficiency of SENS-Join compared with baseJoin. 7. Conclusion In this work, we suggest the cost based join strategy, called INJECT, for tree routing sensor networks. To do this, we suggest diverse join plans. And, based on the basic cost model to gather sensor readings in a region and to transmit the gathered data to a certain node via tree routing, we devise cost models for diverse join plans. Since we devise the cost models based on reasonable assumptions, we have confidence that our cost models can be easily extended to other join plans. To show the performances of our devised join plans and effectiveness of the cost based query optimization in tree routing sensor networks, we implement diverse join plans, and conduct an extensive experimental study over diverse conditions. In our experiment, we show that some devised join plans show the best performance over other cases. However, there is no superior plan over all cases. Our experiments show that our proposed method, INJECT, chooses the optimal or near optimal join plan over diverse cases. Thus, INJECT extends the lifetime of sensor networks. This work focuses on the efficient in-network processing for a single join query. Thus, as the future work, we will conduct a study about a technique that efficiently processes multiple join queries in networks, simultaneously, as well as the corresponding cost model. Acknowledgements We would like to thank the editor and anonymous reviewers for their helpful comments. This work was supported in part by the National Research Foundation of Korea grant funded by the Korean government (No. 2010-0016165) and in part by the National Research Foundation of Korea grant funded by the Korean government (MEST) (No. 2011-0000377). References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] D.J. Abadi, S. Madden, W. Lindner, Reed: robust, efficient filtering and event detection in sensor networks, in: Proceedings of VLDB, 2005, pp. 769–780. P.A. Bernstein, D.-M.W. Chiu, Using semi-joins to solve relational queries, J. ACM 28 (1) (1981) 25–40. B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM 13 (7) (1970). B.J. Bonfils, P. Bonnet, Adaptive and decentralized operator placement for in-network query processing, in: Proceedings of Information Processing in Sensor Networks, 2003, pp. 47–62. D. Chu, A. Deshpande, J.M. Hellerstein, W. Hong, Approximate data collection in sensor networks using probabilistic models, in: Proceedings of ICDE, 2006, p. 48. A. Coman, M.A. Nascimento, J. Sander, On join location in sensor networks, in: Proceedings of International Conference on Mobile Data Management (MDM), 2007, pp. 190–197. A. Demers, J. Gehrke, R. Rajaraman, N. Trigoni, Y. Yao, The cougar project: a work-in-progress report, SIGMOD Record 32 (3) (2003) 9–18. A. Deshpande, C. Guestrin, S. Madden, J.M. Hellerstein, W. Hong, Model-driven data acquisition in sensor networks, in: Proceedings of VLDB, 2004, pp. 588–599. W.R. Heinzelman, A. Chandrakasan, H. Balakrishnan, Energy-efficient communication protocol for wireless microsensor networks, in: Proceedings of Annual Hawaii International Conference on System Sciences(HICSS), 2000. A.R. Hevner, S.B. Yao, Query processing in distributed database systems, IEEE Trans. Software Eng. 5 (3) (1979) 177–187. 3458 [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458 Y. Kotidis, Snapshot queries: towards data-centric sensor networks, in: Proceedings of ICDE, 2005, pp. 131–142. S. Lindsey, C.S. Raghavendra, K.M. Sivalingam, Data gathering in sensor networks using the energy delay metric, in: Proceedings of IPDPS, 2001, p. 188. S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, Tag: a tiny aggregation service for ad-hoc sensor networks, in: Proceedings of OSDI, 2002. S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, Tinydb: an acquisitional query processing system for sensor networks, ACM Trans. Database Syst. 30 (1) (2005) 122–173. F. Marcelloni, M. Vecchio, Enabling energy-efficient and lossy-aware data compression in wireless sensor networks by multi-objective evolutionary optimization, Inform. Sci. 180 (10) (2010) 1924–1941. C. Ok, S. Lee, P. Mitra, S. Kumara, Distributed routing in wireless sensor networks using energy welfare metric, Inform. Sci. 180 (9) (2010) 1656–1670. A. Pandit, H. Gupta, Communication-efficient implementation of range-joins in sensor networks, in: Proceedings of DASFAA, 2006, pp. 859–869. A. Silberstein, R. Braynard, Y. Yang, Constraint chaining: on energy-efficient continuous monitoring in sensor networks, in: Proceedings of ACM SIGMOD, 2006, pp. 157–168. U. Srivastava, K. Munagala, J. Widom, Operator placement for in-network stream query processing, in: Proceedings of ACM PODS, 2005, pp. 250–258. M. Stern, Optimal locations for join processing in sensor networks, in: Proceedings of International Conference on Mobile Data Management (MDM), 2007, pp. 336–340. M. Stern, E. Buchmann, K. Böhn, Towards efficient processing of general-purpose joins in sensor networks, in: Proceedings of ICDE, 2009. X. Yang, H.B. Lim, M.T. Özsu, K.L. Tan, In-network execution of monitoring queries in sensor networks, in: Proceedings of ACM SIGMOD, 2007, pp. 521– 532. H. Yu, E.-P. Lim, and J. Zhang, On in-network synopsis join processing for sensor networks, in: Proceedings of the International Conference on Mobile Data Management, 2006, p. 32.
© Copyright 2026 Paperzz