Cost based in-network join strategy in tree routing sensor

Information Sciences 181 (2011) 3443–3458
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Cost based in-network join strategy in tree routing sensor networks
Jun-Ki Min a, Heejung Yang b, Chin-Wan Chung b,⇑
a
b
School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-myeon, Cheonan, Chungnam 330-708, Republic of Korea
Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Republic of Korea
a r t i c l e
i n f o
Article history:
Received 30 November 2009
Received in revised form 6 October 2010
Accepted 3 April 2011
Available online 19 April 2011
Keywords:
Query processing
Sensor network
Join
Cost model
a b s t r a c t
The tiny and smart sensors enable applications which access a network of hundreds or
thousands of sensors. In many applications, joins are used frequently to find relationships
of readings of different sensors such as the correlation of sensor readings in distinct
regions.
In this paper, we present a cost based in-network join strategy called INJECT. Since the
optimal join plan is determined according to various conditions such as data distributions
and predicates of joins, it wastes the energy of sensors to use a fixed join plan blindly.
Based on the analysis on how join queries can be handled in sensor networks, we devise
several join plans. In particular, since the data transmission dominates the energy consumption of a sensor, we devise cost models each of which reflects the transmission cost
of a join plan. Experimental results confirm that INJECT chooses the optimal or near optimal plan under various conditions.
Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction
Wireless sensor networks (WSNs) are systems that are typically composed of a large number of sensors and the base station where a user can access data. Sensor nodes in sensor networks are severely constrained in terms of battery power.
Replacing the battery of a sensor is either too expensive or impossible. The energy preservation is a major research issue
since it directly impacts the life time of a network.
Recent research has shown that radio communication is the most expensive. Thus, many techniques in diverse areas such
as the routing protocol [9,12,16], event detection [1,22], in-network aggregation [13], and approximate data gathering
[5,8,11,15] have been proposed in order to reduce the communication overhead.
In-network aggregation provides a great opportunity for reducing the communication overhead using the summary data
(e.g., SUM) and/or exemplary data (e.g., MIN and MAX). However, a single aggregated value is insufficient to analyze the
whole sensor field in some applications [5].
Thus, some data gathering techniques [5,8,11,15] in sensor networks have been proposed. Periodic reporting of sensor
readings drains the energy of sensors since it results in excessive communication. So, to reduce the communication overhead, in-network approximation techniques have been proposed. In this approach, data model [11] or data compression
[15] technique are applied.
In some applications, a user wants to identify the relationship between sensor readings in different regions. For example,
a climatologist wants to analyze the correlation of the rain fall of a region and the temperature of another region. This regional correlation can be expressed as a join query of sensor readings in two regions.
⇑ Corresponding author. Tel.: +82 42 869 3537; fax: +82 42 869 3577.
E-mail addresses: [email protected] (J.-K. Min), [email protected] (H. Yang), [email protected] (C.-W. Chung).
0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2011.04.017
3444
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
A naive plan to answer a join query is to gather sensor readings of two regions at the base station and to perform a join
operation at the base station. This approach may waste much energy of sensors since sensor readings which will not participate in the join results have to be transmitted to the base station.
An alternative plan is to perform a join in sensor networks in order to filter out unrelated sensor readings as soon as possible. However, using a fixed in-network join plan blindly may waste the sensor energy as much as the naive plan does. Thus,
in this paper, we propose INJECT which is an In-Network Join stratEgy using Cost based optimization in Tree routing sensor
networks. Tree based routing has been proposed as an energy efficient mechanism to transmit sensor readings. Due to its
simplicity and manageability, many works for sensor networks are based on tree routing where a message from a node is
passed to another node through a routing tree. Our work is also based on tree routing.
Join processing has a long and rich history in the database field. However, due to the hierarchical structure of sensor networks, traditional techniques are not directly applicable. Thus, recently, research on in-network join processing [17,20–23]
has been proposed to reduce the communication overhead. In in-network join processing, a node generates join results.
However, some techniques just propose join plans without proper cost models. Also, although other work suggests cost
models for in-network joins, the proposed cost models are too rough to be used in the query optimization. In addition, some
in-network join plans are based on location based routing protocols such as GPSR. Thus, these plans are not applied to the
tree-based routing environments.
To the best of our knowledge, our work is the first for the cost-based optimization in tree routing sensor networks. Based
on our cost model, accurate cost models for diverse in-network join techniques including the future techniques can be
devised.
Our contribution. Our work focuses on an efficient in-network join processing with a low transmission cost in order to
conserve the energy of sensors. Let sets of sensor readings in a region QL and a region QR be L and R, respectively, and a join
predicate of L and R be PJOIN(L, R). Then a join query is defined as LfflPJOIN ðL;RÞ R. In the result of the join, an element in L associates with many elements in R, and vice versa. INJECT considers semijoin operators as in-network join operators. Since a
semijoin generates tuples which will participate in join results, the unrelated tuples can be filter out. Thus, the network communication overhead is reduced. In INJECT, we assume that actual join operation is performed at the base station which has
unlimited power, using the results of semijoins (i.e., the results of L n R and R n L).
INJECT has the following combination of contributions to perform an in-network join in an energy efficient manner.
We propose an in-network join framework using cost based optimization to identify an efficient join plan on tree based
routing networks.
For INJECT, we devise diverse join plans in sensor network environments. First, we suggest three basic join plans: baseJoin,
coverJoin, and sideJoin. And, by the analysis on the hierarchical structure of tree routing networks, we devise partitionJoin.
In addition, we devise synopsisJoin by combining the synopsis technique and partitionJoin.
We devise accurate cost models for diverse join plans in tree routing. For cost based optimization, an accurate cost model
is an indispensable component. In our work, we make a basic cost model for gathering sensor readings in a query region
and sending the gathered readings to a node in the tree based routing environments. Based on the basic cost model, we
develop the cost models for the devised join plans.
We provide an extensive experimental study of our framework in diverse environments. Our experimental results show
that INJECT provides accurate cost models and therefore the most efficient query plan is selected.
Organization of the paper. In the remainder of the paper, we present details of INJECT. In Section 2, we present the characteristics of our join operation. Section 3 describes three basic join plans and their cost models. In Section 4, we present
enhanced join plans that we devise. Section 5 contains the experimental results. Section 6 outlines related works. Finally,
in Section 7, we summarize our work and suggest some future studies.
2. Preliminaries
We first describe some assumptions used in the paper and formalize the in-network join processing problem.
Let us consider a set of sensor nodes S = {s1, s2, . . . , sn} located at positions {(x1, y1), . . . , (xn, yn)}, respectively. Sensor
readings can be collected at the base station based on tree routing protocol. Each sensor keeps the hop distance which
is the number of hops from the base station to it. Two nodes capable of bi-directional wireless communication are
referred to as neighbors. Also, the base station knows the locations of sensors and the tree routing hierarchy among
sensors.
In INJECT, a join query is submitted to the base station. The base station generates diverse join plans which will be presented in Sections 3 and 4. Then, the base station identifies the optimal join plan among diverse join plans using their cost
models.
We analyze the join processing problem in sensor networks for a join query between relations. For the simplicity of presentation, we assume two relations. In reality, two-way joins are the most frequent, and also, they are the basis of m-way
joins. A relation can be a set of sensor readings in a region as shown in Fig. 1. There are many motivating examples about
in-network joins in related literatures. Here, we present an SQL form for a join.
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
3445
baseStation
…
…
QL
…
QR
Fig. 1. Routing tree and join scenario.
SELECT L.⁄, R.⁄
FROM Sensor L, Sensor R
WHERE L.location in QL
AND R.location in QR
AND PL(L)
AND PR(R)
AND PJOIN(L, R)
{SAMPLE INTERVAL X} {FOR D}
In the above SQL statement, relations L and R are the sets of sensor readings restricted by regions QL and QR, respectively.
Also, PL and PR are selection predicates for the relations L and R, respectively. If there is no selection predicate for a relation,
all tuples in a relation (i.e., all sensor readings in QL (or QR)) are sent to the base station. Additionally, PJOIN(L, R) is a join predicate for the relations L and R. Since the sensor readings are continuously generated, the above SQL statement is continuously
executed. For this, we use TinyDB syntax [14]: SAMPLE INTERVAL and FOR. The query is executed once per X seconds for a
period of D seconds. The SAMPLE INTERVAL and FOR terms are optional. The default X for SAMPLE INTERVAL is 1 s and D for
FOR is 1.
The problem that we intend to solve is formalized as follows:
Problem definition. Given a sensor network consisting of a set of sensor nodes S = {s1, s2, sn}, tree routing is used to disseminate a query and collect sensor readings. Let sets of sensor readings in a region QL and a region QR be L and R, respectively,
and the join predicate of L and R be PJOIN(L, R). Find the most effective plan to process a join query defined as LfflPJOIN ðL; RÞ R.
In order to estimate a query cost, the selectivities of selection predicate PL and PR as well as join selectivity of join predicate PJOIN(L, R) are required. To obtain these statistics, the base station executes an in-network join plan, baseJoin (described
in Section 3), and collects required statistics from join results in learning phases. Also, the query optimizer recomputes the
costs of query plans using the history of the previous query results periodically and the base station broadcasts a more efficient plan, if exists. We omit the details of this since our work focuses on the diverse join plans and their accurate cost
models.
3. In-network join processing
In this section, we present three basic join plans considered in INJECT and suggest their cost models based on tree routing.
Based on the basic join plans, some extended join plans will be presented in Section 4.
The proposed join plans are summarized in Table 1.
3.1. Basic join plans
In order to explain the basic join plans and their cost models, we define some basic concepts.
Definition 1. The sensing region of a sensor s, SR(s) is a minimum bounding region that covers the locations of s’s
descendants and itself in a routing tree. The descendants of a sensor s are all the nodes that are in its subtree.
Definition 2. The covering node set of a region Q, cov(Q) is a set of nodes each of which has SR() containing the given region
Q.
3446
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
Table 1
Summary of join plans.
Join plan
Intuition
baseJoin
coverJoin
sideJoin
partitionJoin
synopsisJoin
fullsynopsisJoin
A join is performed at the base station
Semijoins are performed at the node that can obtain all sensor readings of two regions
Sensor readings of a region are sent to a node in the other side to perform a semijoin
Sensor readings of a region are distributed into nodes in the other side to perform semijoins
Similar to partitionJoin except sending a synopsis instead of sensor readings
synopsisJoin is applied to both query regions
s1
SR(s2)
s2
s3
s4
s5
s7
s6
s8
Q
Fig. 2. Sensing region SR().
Definition 3. The minimal cover node of a region Q, covmin(Q) is a node whose SR() covers the given region and is minimal.
By Definitions 2 and 3, the node covmin(Q) for a region Q is an element in cov(Q). Also, since our data transmission is based
on tree routing, covmin(Q) has the maximum hop distance from the base station among elements in cov(Q).
For example, as shown in Fig. 2, for a given query region Q presented as a solid box, the sensing region of s2 (=SR(s2)) covers Q. Also, SR(s1) covers Q. Thus, cov(Q) = {s1, s2}. Note that s1 and s2 can collect all readings of sensors in Q. Since SR(s2) is
smaller than SR(s1), covmin(Q) = s2. We use a rectangle to illustrate a region for easy understanding. However, any polygon can
be used.
Three basic join plans are plotted in Fig. 3. Roughly speaking, join plans are classified by the join node where a semijoin
operation is performed.
As mentioned earlier, a naive way to answer a join query is to gather all sensor readings at the base station and to perform
a join at the base station. We call this plan baseJoin.
Another join plan is that an intermediate node in the tree performs semijoins. An intermediate node can collect the sensor
readings obtained from sensors in QL and QR in order to perform semijoins. Thus, the intermediate node (i.e., join node) is an
S
ancestor node of all sensors in QL and QR. Thus the join node is covmin(QL QR). We call this plan coverJoin.
S
In coverJoin, we do not consider the other nodes in cov(QL QR) as a join node. Suppose that a semijoin operation at a node
S
S
si 2 cov(QL QR), which is not covmin(QL QR), is an efficient execution plan.It means that the size of the semijoin result is
S
smaller than that of all sensor readings. By Definition 3, a semijoin operation can be performed at covmin(QL QR). Thus, it
S
is more efficient that sending the semijoin results from covmin(QL QR) rather than si.
The final basic join plan is sideJoin. In sideJoin, as shown in Fig. 3(c), the join column of sensor readings in QL is sent to the
other side where a semijoin operation is performed. Thus, a semijoin operation is performed at a node which is an ancestor
node of sensors in QR. Therefore, a semijoin is performed at covmin(QR).
Fig. 3. Basic join plans.
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
3447
Among above three join plans, coverJoin and sideJoin are categorized into the in-network join approach since semijoins are
performed at nodes in a network. In general, if the join selectivity is very high, the in-network join approach is useless. In
contrast, if the join selectivity is very low, an in-network join approach is beneficial since the size of the join result is much
smaller than that of sensor readings participating in a join operation. Therefore, by using the cost model, the optimal join
plan should be selected.
In our work, we only consider semijoin operators for in-network join processing. Note that, since the base station can
get all the tuples (=L n R and R n L) participating in the join result, the join result can be computed at the base station without loss of information.
S
Consequently, in coverJoin, two semijoin operations (i.e. L n R and R n L) are performed at covmin(QL QR) and two semijoin results are sent to the base station.
S
In sideJoin, R n L is performed at covmin(QR). To do this, sensor readings in QL are sent to covmin(QL QR). And then the join
S
column, not whole sensor readings, is sent from covmin(QL QR) to covmin(QR). Therefore, R n L is performed at covmin(QR) and
S
then L n(R n L) is performed at covmin(QL QR) using the semijoin result (=(R n L)) obtained from covmin(QR).
Since similar arguments hold when the join column of R is sent to covmin(QL), we only present the case that the join column of a relation L is sent to covmin(QR) in this paper.
3.2. Cost model
In this section, we present cost models for INJECT. To estimate costs, various statistical estimations can be included. It will
incur additional cost to maintain required statistics incrementally. Thus, in our cost model, we use simple but reasonable
statistical estimation. However, techniques for statistical estimations can be orthogonally applied.
As mention in Section 2, the base station identifies the optimal join plan in INJECT. In order to disseminate the optimal
plan, the energy is consumed. Since the join plans should be propagated to all participating nodes in query regions, the dissemination costs of join plans are similar. In addition, the plan dissemination cost is quite small compared to the join processing cost since the join processing is performed for a period D. Therefore, we only consider the join processing cost.
Since the transmission sizes (i.e., sizes of data to be joined) can be obtained using our model, the computation cost can be
derived. However, it is a well known fact that the computing cost is ignorable compared to the transmission cost in sensor
network environments. In the case of the Berkeley sensor motes, transmitting a single bit is equivalent to 800 instructions in
terms of power consumption [13]. Thus, like related literature, we omit the computation cost. Furthermore, it is really hard
to reflect all aspects to a cost model. Thus, an abstraction is required. For instance, in the query optimization in traditional
databases, the disk I/O time is mainly considered although the computing time exists. In addition, the computing cost is
mainly determined by the size of data to be joined. Thus, the total computing costs of in-network join plans are similar.
Therefore, considering the communication overhead only is sufficient to choose an effective join plan.
In addition, for simplicity, we do not consider a link failure. By retransmission, the link failure can be solved. Thus, using
the retransmission probability, our cost model can be extended in a straightforward manner. These assumptions allow us to
make a concise cost model.
As shown in Fig. 4, to transmit sensor readings in the region Q, sensor readings are gathered at the node s(=covmin(Q))
through the tree based routing and then gathered sensor readings are sent to the node d.
In addition, as mentioned earlier, the selection operation PQ is applied to sensor readings of the region Q. Let the selectivity of PQ be sel(PQ) which is the proportion of sensor readings that satisfy the selection operation. Note that sel(PQ) of a
node which is not in the region Q is zero.
Let the size of a sensor reading be r. Then, when the node s transmits a message, the transmission cost to a neighbor, T(s)
(i.e., the average size of a message) can be expressed by Eq. (1).
d
s
Q
Fig. 4. Data transmission with a query region.
3448
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
baseStation
…
…
DC
cov min(Q L U Q R)
DL
…
cov min(Q L)
…
DR
cov min(Q R)
Fig. 5. Hop distances.
TðsÞ ¼ r selðP Q Þ þ
X
TðcÞ
ð1Þ
c2childðsÞ
Using the transmission cost T(), we can compute the cost to send all readings in region Q to the node d. As mentioned earlier,
the node s(=covmin(Q)) gathers the sensor readings in Q and s transmits the gathered readings to d. Thus, the cost is computed
as follows:
CostðQ ; dÞ ¼ Costgathering ðsÞ þ Costsending ðs; dÞ
ð2Þ
To gather the sensor readings at the node s, the children of s gather sensor readings from their descendants and send gathered readings to s. Thus, Costgathering(s) is expressed recursively as follows:
Costgathering ðsÞ ¼
X
ðCost gathering ðcÞ þ TðcÞÞ
c2childðsÞ
When the node s transmits data to its ancestors, the transmission costs T() of s0 ancestors are equal to that of s. Thus, the
transmission cost from s to d is expressed as follows, where hopDiff(s, d) denotes the difference of hop distances between
s and d:
Costsending ðs; dÞ ¼ hopDiff ðs; dÞ TðsÞ
Based on Eq. (2), we can derive the cost models of three basic join plans. Suppose that, as shown in Fig. 5, the difference of
S
S
S
hop distances from covmin(QL) to covmin(QL QR), that from covmin(QR) to covmin(QL QR), as well as that from covmin(QL QR)
to the base station are DL, DR and DC, respectively.
In baseJoin, the sensor readings in QL and QR are gathered at covmin(QL) and covmin(QR), respectively. Two gathered data sets
S
S
move to covmin(QL QR). The node covmin(QL QR) sends the data sets came from two nodes to the base station. Thus, the cost
of baseJoin costbase() can be derived as follows:
Costbase ðQ L ; Q R Þ ¼ Cost gathering ðcov min ðQ L ÞÞ þ Costgathering ðcov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ R ÞÞ
þ DC ðTðcov min ðQ L ÞÞ þ Tðcov min ðQ R ÞÞÞ
ð3Þ
In coverJoin, like baseJoin, the sensors in QL and QR send data to covmin(QL) and covmin(QR) as well as the gathered data is sent to
S
covmin(QL QR). However, in contrast to the baseJoin plan, as shown in Fig. 3(b), semijoin operations are performed at
S
covmin(QL QR).
S
In order to compute the transmission cost from covmin(QL QR) to the base station, we compute the size of L n R and the
size of R n L. The join selectivity of a semijoin of L by R gives the fraction of tuples of L which join with tuples of R. Accurate
estimation of the join selectivity is important since the effective join plan is based on the join selectivity. It is difficult to
accurately estimate the join selectivity. But, an approximation for the semijoin selectivity was suggested in [10] as follows:
SJSðRnA LÞ ¼
sizeðPA ðLÞÞ
; where domðAÞ is the domain of attribute A:
sizeðdomðAÞÞ
ð4Þ
In the above equation, the semijoin selectivity SJS(R nA L) is only affected by the size of the join column A of L. Thus, when the
size of the gathered sensor readings in QR at a certain time is T(covmin(QR)), the size of semijoin result is T(covmin(QR)) SJS(R nA L). We use SJSL.A to denote SJS(R nA L) concisely. Using the semijoin selectivities SJSR.A and SJSL.A, where the join
S
attributes of R and L are A, we can obtain the size of data generated at covmin(QL QR).
The cost of coverJoin, costcover() is estimated as follows:
Costcov er ðQ L ; Q R Þ ¼ Cost gathering ðcov min ðQ L ÞÞ þ Cost gathering ðcov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ R ÞÞ
þ DC ðTðcov min ðQ L ÞÞ SJSR:A þ Tðcov min ðQ R ÞÞ SJSL:A Þ
ð5Þ
3449
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
In the sideJoin plan, like the other join plans, all sensor readings are gathered at covmin(QL) and covmin(QR). But, unlike the
other join plans, the join column A of L is sent to the covmin(QR) in order to perform R n L.
S
Since, from covmin(QL) to covmin(QL QR), the gathered data is sent, the transmission cost DL T(covmin(QL)) is required.
And then, the projection result of the gathered sensor readings on the join attribute is sent to covmin(QR). In order to make
the cost model concise, following the general convention, we simply assume that the cardinality of a projected result is equal
to the cardinality of the original relation. Therefore, the transmission cost is reduced to Tðcov min ðQ L ÞÞ rj , where j is the size of
a join column and r is the size of a sensor reading. So, the transmission cost of a join column of L to covmin(QR)
(= DR Tðcov min ðQ L ÞÞ rj ) is required in sideJoin.
And, as shown in Fig. 3(c), the semijoin R n L is performed at covmin(QR) and the semijoin result is sent back to
S
covmin(QL QR). As shown in the cost of coverJoin (Eq. (5)), we can obtain the size of the semijoin result using the join selectivity. The size of the semijoin R n L is T(covmin(QR)) SJSL.A. Thus, the transmission cost of the semijoin result from covmin(QR)
S
to covmin(QL QR) is DR T(covmin(QR)) SJSL.A.
S
Finally, at the node covmin(QL QR), the semijoin L n R (=L n (R n L)) is performed and the two semijoin results are sent to
the base station. This transmission cost is DC (T(covmin(QL)) SJSR.A + T(covmin(QR)) SJSL.A). Therefore, the cost of sideJoin,
costside(), is derived as follows:
Costside ðQ L ; Q R Þ ¼ Costgathering ðcov min ðQ L ÞÞ þ Cost gathering ðcov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ R ÞÞ SJSL:A þ DC ðTðcov min ðQ L ÞÞ SJSR:A þ Tðcov min ðQ R ÞÞ SJSL:A Þ
j
r
ð6Þ
4. Enhanced join plans
In this section, we present other join plans called partitionJoin and synopsisJoin exploiting hierarchical structures (i.e., tree
routing). Even though semijoin and synopsis join approaches are well known in distributed databases (i.e., a flat structure),
our work is more general than the previous join approaches for distributed databases. In addition, to choose optimal join
locations for partitionJoin and synopsisJoin, we devise a recursive expression for dynamic programming and its greedy
version.
4.1. Partition join
In coverJoin and sideJoin plans, semijoins are performed at one or two nodes. However, in the partitionJoin plan, semijoin
operations are performed at several nodes.
S
S
The basic intuition of partitionJoin is that cRc n L = R n L, where cRc = R. For general join operators, the intuition also
holds.
As shown in Fig. 6, a query region QR can be partitioned into several subregions. Each subregion can be obtained using the
following definition.
Definition 4. A subregion of a query region QR is an element in a set {Q R1 , Q R2 ; . . . ; Q Rn } such that, for each
T
c 2 child(covmin(QR)), Q Rc = SR(c) QR, where Q Rc – ;.
An example of partitionJoin is described in Fig. 7. Since the node covmin(QR) gathers sensor readings from Q Rc , the semijoin
Rc n L can be performed at c or covmin(QR).
Thus, the problem of partitionJoin is to choose the set of child nodes which perform semijoin operations. This problem is
recursively applied to the descendants of covmin(QR) since partitionJoin can be applied to a subregion Q Rc . For example, as
shown in Fig. 7, the nodes s1 and s2 receive PA(L) and return R1 n L and R2 n L, respectively. But, s3 just sends R3. Thus, R3
n L is performed at covmin(QR). Note that, partitionJoin can be recursively applied to children of s1 and s2.
Let costpartition(QL, QR) be the cost of the partitionJoin plan. Like sideJoin, the join column of L is sent to covmin(QR) and the
S
semijoin result of R n L is transmitted to covmin(QL QR) in partitionJoin. However, in contrast to sideJoin, the join column of L
cov min(Q R)
Q R1
Q R2
Q R3
QR
Fig. 6. Partitions of QR.
3450
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
Fig. 7. An Example of partitionJoin.
moves down to child nodes of covmin(QR). And, some child nodes whose SR() s cover Q Ri generate the result of Ri n L but the
other child nodes simply return their gathered readings. Thus, the cost costpartition(QL, QR) is derived as follows:
Costpartition ðQ L ; Q R Þ ¼ Costgathering ðcov min ðQ L ÞÞ þ Costpartial ðcov min ðQ L Þ; cov min ðQ R ÞÞ þ DL Tðcov min ðQ L ÞÞ þ DR Tðcov min ðQ L ÞÞ
j
þ DR Tðcov min ðQ R ÞÞ SJSL:A þ DC ðTðcov min ðQ L ÞÞ SJSR:A þ Tðcov min ðQ R ÞÞ SJSL:A Þ
r
ð7Þ
The term Costpartial() is used in Costpartition(QL,QR) instead of Cost gathering(covmin(QR)) in Costside(QL, QR).
When a node sends a message to another node, the receiving node consumes energy to receive the message. The receiving
cost is generally proportional to the sending cost.1 Here, we use Ts() and Tr() in Costpartial() instead of T() since, by broadcasting
from a parent, several children receive a message in WSN environments.
The term Costpartial(covmin(QL), covmin(QR)) is as follows:
(
j X
j X
Costpartial ðcov min ðQ L Þ; cov min ðQ R ÞÞ ¼ minC # qchildðcov min ðQ R ÞÞ T s ðcov min ðQ L ÞÞ þ
T r ðcov min ðQ L ÞÞ þ
TðcÞ SJSL:A
r c2C
r c2C
)
X
X
0
0
þ
ðCost partial ðcov min ðQ L Þ; cÞ þ
ðCost gathering ðc Þ þ Tðc ÞÞ ;
c2C
c0 2qchildðcov min ðQ R ÞÞC
where qchildðcov min ðQ R ÞÞ is a set of children whose SRðÞ overlap with Q R and T s ðcov min ðQ L ÞÞ ¼ 0 if C ¼ ;
ð8Þ
In Costpartial(), a subset C of qchild(covmin(QR)) is the set of nodes which generate the results of semijoins. To do this, PA(L) is
P
sent to C. Its cost is T s ðcov min ðQ L ÞÞ rj . Then, all nodes in C receive PA(L). The cost is c2C T r ðcov min ðQ L ÞÞ rj .
The transmission cost of Rc n L from c 2 C is T(c) SJSL.A. Since partitionJoin is applied recursively to c, Costpartial(covmin(QL),
c) represents this recursion. Also, since a child c0 , which does not perform a semijoin, simply gathers sensor readings of
descendants and transmits to its parent, the term (Costgathering(c0 ) + T(c0 )) is used.
If C is empty, Ts(covmin(QL)) = 0 since PA(L) does not need to be transmitted to the children. Note that, if it is beneficial that
covmin(QR) does not transmit PA(L) to child nodes, Costpartition() is equal to Costside(). Therefore, sideJoin is a specific plan of
partitionJoin.
In the optimal plan obtained from the above recursive expression in Eq. (8), each child has the optimal plan. In other
words, the principle of optimality holds. Thus, dynamic programming can be applied to find the optimal join plan. The time
complexity of dynamic programming is O(2n), where n is the number of descendant nodes of covmin(QR), since all subsets of
descendants are evaluated.
In order to reduce the time complexity, we devise a greedy method. Fig. 8 shows a greedy algorithm for partitionJoin. This
algorithm traverses the subtree of covmin(QR) in a breadth-first traversal manner.
The procedure compute_partial() performs the following:
Candidate children which may generate the semijoin results are computed (Lines 7–14). If a child node c 2 qchild() generates the semijoin result, c receives PA(L) and returns the result of Rc n L. Thus, if the transmission cost T(c) is greater
than the sum of the receiving cost of PA(L) and the transmission cost of the semijoin result (Line 8), it may be beneficial
that the semijoin is performed at c. If c is a candidate, the cost sums up as costC (Line 10).
Children for generating the semijoin results are computed (Line 15–21). In order to perform semijoins at a subset C, a
P
parent node (i.e., aNode) sends PA(L). Thus, when the benefit (= c2C TðcÞ costC ) is greater than T s ðcov min ðQ L ÞÞ rj , doing
semijoins at C is efficient (Line 15). If it is beneficial to perform the semijoins at C, we can further consider the children of
C (Line 16). The time complexity of our algorithm is O(n) since the algorithm traverses child nodes in the breadth first
traversal manner.
Costpartial(covmin(QL), covmin(QR)) based on the heuristic is computed as Costpartial (Lines 3–22). Costpartial is initialized by 0
(Line 3). If it is beneficial to perform semijoins at C (Line 15), Costpartial is added to the sending cost from parent to C
(=ðT s ðcov min ðQ L ÞÞ rj ) and costC (Line 17). If it is not beneficial to perform a semijoin at a child c, Costpartial is added to
Costgathering(c) and T(c) (Lines 12, 20).
1
In [18], the authors say that the receiving cost is generally 60% less than the sending cost.
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
3451
Fig. 8. A greedy algorithm for partition join.
Thus, the query optimizer can choose the best plan between partitionJoin and sideJoin by the comparison of Costpartial and
Costgathering (covmin(QR)).
4.2. Synopsis join
In partitionJoin presented in Section 4.1, the transmission cost for PA(L) is a key factor to apply partitionJoin. Thus, if we
can reduce this cost, a more efficient join plan can be obtained.
A synopsis is a summary of a relation. By using a synopsis, we may reduce this cost. We call this method synopsisJoin. In
our work, with respect to the join condition, different synopses are used.
As mentioned in [22], when a join is not an equi-join, sending the min (or max) value is sufficient to perform a join. For
example, if a join condition is L.A < R.A, the minimum value of L.A is only required to perform R n L since all tuples in R with
R.A values greater than the minimum value of L.A participate in a join operation.
For an equi-join, we use the bloom filter [3]. The bloom filter consists of an array of m bits and a set of k independent hash
functions each of which maps an element to an integer in the range of [1, m]. An element in a set is represented in the bloom
filter by setting all positions, computed by hash functions, of the bit array to 1. We can check a membership using the bloom
filter. Suppose that the bloom filter is constructed using a set of attribute values of L.A. If at least one of the positions related
to an attribute value of the array is 0, the attribute value is not a member. Thus, by using the membership checking feature of
the bloom filter, an equi-join is performed without the original relation. There can be some false positives associated with
using the bloom filter, however they are not significant.
synopsisJoin is divided into two specific plans: synopsisJoin and a variant called fullsynopsisJoin.
S
The synopsisJoin plan is similar to partitionJoin except that a synopsis of PA(L) is sent from covmin(QL QR) to covmin(QR).
In the previous plans, the size of data to be transmitted is computed. But, since the fixed sized synopsis is transmitted, a
difference model is required.
S
In synopsisJoin, covmin(QL QR) may not send a synopsis of PA(L) if L does not come from covmin(QL). Let m be the size of a
synopsis and P(s) be the probability that a node s transmits data.
Then, P(s) = 1 Prob (node s does not transmit data). The transmission probability P() is determined by a query region
and the selectivity of a selection predicate. In Fig. 4, in order for s not to transmit data, the sensor reading of the node s does
3452
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
not satisfy the selection condition PQ as well as all children of s do not transmit data to s. Therefore, P(s) can be expressed by
Eq. (9).
PðsÞ ¼ 1 ð1 selðPQ ÞÞ Y
ð1 PðcÞÞ
ð9Þ
c2childðsÞ
So, the average transmission cost for a synopsis of PA(L), TsysL, is derived as follows:
T sysL ¼ m Pðcov min ðQ L ÞÞ
ð10Þ
S
Therefore, DR TsysL is the synopsis transmission cost from covmin(QL QR) to covmin(QR).
And then, similar to partitionJoin, a synopsis of PA(L) is propagated to descendants of covmin(QR) to perform semijoins. This
cost is derived as Eq. (11). TsysL,s and TsysL,r are the sending and receiving cost of a synopsis of PA(L), respectively. Additionally,
in Fig. 8, Tr, and Ts at Lines 8, 10, 15, and 17 are replaced with TsysL,r and TsysL,s, respectively, for the greedy algorithm of
synopsisJoin.
Costpartial;synopsis ðcov min ðQ L Þ; cov min ðQ R ÞÞÞ ¼ minC # qchildðcov min ðQ R ÞÞ fT sysL;s þ
þ
X
X
T sysL;r þ
c2C
c2C
Costpartial;synopsis ðcov min ðQ L Þ; cÞ þ
TðcÞ SJSL:A
X
ðCost gathering ðc0 Þ
v min ðQ R ÞÞC
c0 2qchildðco
c2C
þ Tðc0 ÞÞg;
X
where T sysL;s ¼ 0 if C ¼ ;
ð11Þ
S
In synopsisJoin, L is sent to covmin(QL QR). In this case, since some tuples in L will not participate in the join result, it wastes
the energy of sensors.
To solve this problem, we devise the fullsynopsisJoin plan. In fullsynopsisJoin, covmin(QL) sends a synopsis of PA(L) instead of
S
S
L itself to covmin(QL QR), and then covmin(QL QR) sends the synopsis to covmin(QR).
fullsynopsisJoin consists of four steps.
Generating a synopsis of PA(L) on QL.
Performing synopsisJoin plan on QR.
Performing synopsisJoin plan on QL with the synopsis of PA(R n L) using the second step’s result.
Sending the semijoin results to the base station.
At the first step, when the minimum value of attribute L.A is used as a synopsis, each node in QL sends the minimum value
among data from children and its reading. When the bloom filter is used, a node in QL generates the bloom filter by ORing of
child nodes’ bloom filters and inserting the hashed value of its reading since the bloom filter BFU for a set U is equal to _i BF Ui ,
S
where Ui = U. Thus, the cost of the first step Costcon(covmin(QL)) is derived as follows:
Costcon ðcov min ðQ L ÞÞ ¼
X
ðCost con ðcÞ þ PðcÞ mÞ
ð12Þ
c2childðcov min ðQ L ÞÞ
At the second step, the synopsis is sent to covmin(QR) through covmin(QL
for the second step is as follows:
S
QR) and synopsisJoin is applied on QR. Thus, the cost
DL T sysL þ DR T sysL þ Cost partial;synopsis ðcov min ðQ L Þ; cov min ðQ R ÞÞ
ð13Þ
As a result of the second step, the result of R n L is collected at covmin(QR). At the third step, this result is sent to covS
S
QR). Using this result, a synopsis of PA(R n L) is made at covmin(QL QR) and sent to covmin(QL). And then, synopsisJoin
min(QL
is applied on QL to obtain L n R. Thus, the cost for the third step is derived as follows:
DR Tðcov min ðQ R ÞÞ SJSL:A þ DL T sysR þ Cost partial;synopsis ðcov min ðQ R Þ; cov min ðQ L ÞÞ
ð14Þ
In the second term of Eq. (14), we use TsysR as the transmission cost for the synopsis of PA(R n L). When a set of tuples in R
appears with the probability P(covmin(QR)), if the semijoin selectivity is not zero, a synopsis of PA(R n L) is not empty. Thus,
TsysR is sufficient to use as a the transmission cost for the synopsis of PA(R n L).
S
Finally, covmin(QL QR) receives the result of L n R from covmin(QL) and sends the union of R n L and L n R to the base station. This cost is DL T(covmin(QL) SJSR.A + DC (T(covmin(QL)) SJSR.A + T(covmin(QR)) SJSL.A).
S
Here, in order to maintain the consistency of the other plans, we explain that covmin(QL QR) keeps the results of R n L to
send the union of R n L and L n R. However, actually, the result of R n L is generated at the second step and that of L n R is
generated at the third step. Thus, it is possible to send each semijoin result to the base station at the end of each step without
S
keeping the result of R n L at covmin(QL QR).
4.3. Memory requirement of in-network join plans
Generally, sensor nodes have limited memory space. The memory requirement of each join plan should be considered. In
this section, we briefly present the memory requirement of each join plans due to the space limitation.
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
3453
S
In coverJoin, covmin(QL QR) keeps whole L and R to perform R n L and L n R.
S
In sideJoin, covmin(QR) keeps R to perform R n L and covmin(QL QR) keeps L to perform L n R.
Compared to sideJoin, partitionJoin has an advantage for the memory space. In partitionJoin, a sensor c in QR needs to keep
Rc R and PA(L) to perform Rc n L. Thus, when R is partitioned into a set of Rc’s, if the sum of the size of PA(L) and the maximum size of Rc is less than the memory limitation, partitionJoin can be performed.
synopsisJoin reduces the memory requirements of sensors in QR compared to partitionJoin since a synopsis of PA(L) is used
instead of PA(L).
S
However, in sideJoin, paritionJoin, synopsisJoin, L should be kept in covmin(QL QR) to perform L n R. Thus, if the memory
S
space of a sensor (i.e., covmin(QL QR)) is smaller than the size of L, sideJoin, partitionJoin and synopsisJoin cannot be applied.
In contrast to the other plans, in fullsynopisJoin, whole relations L and R are not required at either of the two query region
since synopisJoin plan is applied on both QL and QR. Thus, the memory requirement of fullsynopsisJoin is the smallest among
the devised in-network join plans.
5. Experiments
In this section, we demonstrate the effectiveness of INJECT and show the efficiency of our proposed in-network join plans.
Thus, to show the effectiveness of INJECT, we implement diverse join plans: baseJoin, coverJoin, sideJoin, partitionJoin, synopsisJoin and fullsynopsisJoin. We empirically compared the performances of devised join plans and show that INJECT chooses
the optimal or near optimal plan over the diverse environments.
5.1. Experimental environment
In this section, we present the features of the experimental data set and the parameters to configure diverse
environments.
The default network configuration of experiments is 10 10 grid and sensor nodes are placed in each grid point. The base
station is located on the upper left corner. The routing tree is constructed using the FHF (First-Heard-From) network configuration algorithm [14]. The maximum hop distance of the network tree is eight.
As shown in our basic cost model in Eq. (2), the transmission cost is affected by the selection selectivity and the size of a
query region. In addition, the join selectivity affects the communication cost. Thus, to simulate diverse conditions of networks, we use some parameters. The default parameter settings used in our experiments are summarized in Table 2.
We set the size of a tuple to 44 bytes and the size of join attribute to 8 bytes. For synopsisJoin and fullsynopsisJoin, the
bloom filter is used. We set the size of a bloom filter to 30 bytes.
In our experiments, we run the join query for an interval of 1000 epochs. We show the accumulated costs of estimation
using the proposed cost models and those of actual execution.
Generally, according to the sensor’s type, the energy consumption of data transmission is different. And, our cost model is
based on the size of data transmission. Thus, we use the amount of transmitted data as the performance metric.2
5.2. Experimental results
In each experiment, we vary one of the parameters and show its effect. With a few exceptions, the estimated cost accurately reflects the relative rank for each plan.
In previous sections, we only present the case that PA(L) (or a synopsis of PA(L)) is sent to covmin(QR). However, except
baseJoin and coverJoin, the join column of a relation is sent to the opposite side. For example, PA(R) can be sent to covmin(QL)
in sideJoin. Thus, in our experimental result, we attach the prefixes L and R in each plan to represent these cases. The prefix L
denotes that a semijoin is performed at QL side (i.e., PA(R) is sent to covmin(QL)). The prefix R denotes the opposite case.
5.2.1. Join selectivity
To show the effect of the join selectivity, we set SJSL.A for R n L to 0.5 and vary the semijoin selectivity (SJSR.A) for L n R
from 0.1 to 1.0.
Fig. 9 shows the results of the estimated costs and actual costs of diverse plans. As shown in Fig. 9, the pattern of estimate
costs is quite similar to that of the actual costs.
INJECT estimates that LfullsynopsisJoin is superior to other plans when SJSR.A is from 0.1 to 0.7 and RsynopsisJoin is the best
plan when SJSR.A is from 0.8 to 1. In the result of actual costs shown in Fig. 9(b), LfullsynopsisJoin shows the best performance
when SJSR.A is from 0.1 to 0.6. And RsynopsisJoin is superior when SJSR.A is from 0.8 to 1. When SJSR.A is 0.7, the RfullsynopsisJoin
shows the best performance. But the performance of LfullsynopsisJoin is very close to that of RfullsynopsisJoin. This result confirms that INJECT chooses the optimal or near optimal plan among diverse in-network join plans.
2
Since the transmission size of each node can be estimated using our cost model, the other measures such as the number of data transmissions and energy
consumption can be derived easily. However, we omit it due to the space limitation.
3454
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
Table 2
Parameters.
Parameter
Default value
Comments
sel(PQ)
SJS
DC
DL
DR
jQLj
jQRj
0.5
0.5
1
2
2
10
10
Selection selectivity for a region Q
Semijoin selectivity
See Fig. 5
See Fig. 5
See Fig. 5
# of nodes in QL
# of nodes in QR
Generally, the transmission costs of basic join plans (baseJoin, coverJoin, sideJoin) are higher than the enhanced join plans
(partitionJoin, synopsisJoin, fullsynopsisJoin). Among the enhanced join plans, the costs of synopsisJoin and fullsynopsisJoin are
much less than those of the other plans in almost all cases.
Since, in baseJoin, all tuples of L and R satisfying selection predicates are sent to the base station, the semijoin selectivity
does not affect the performance of baseJoin.
S
In coverJoin, L and R are sent to covmin(QL QR) and the results of L n R and R n L are sent to the base station. Thus, since
the size of L n R increases as SJSR.A increases, the cost of coverJoin increases.
Now we consider LsideJoin and RsideJoin plans. As SJSR.A increases, the number of L’s tuples participating in a semijoin result increases. In other words, the size of L n R is smaller than that of R n L when SJSR.A is smaller than SJSL.A. Therefore, when
SJSR.A is smaller than 0.5, LsideJoin is better than RsideJoin since many tuples of L are filtered out by the semijoin L n R at
covmin(QL). But, when SJSR.A is greater than 0.5, RsideJoin is better than LsideJoin. Similar arguments hold on LpartitionJoin
and RpartitionJoin as well as LsynopsisJoin and RsynopsisJoin.
In contrast to the other join plans, LfullsynopsisJoin and RfullsynonsisJoin show the similar performance since semijoins
using synopses are performed at both sides. The difference of LfullsynopsisJoin and RfullsynonsisJoin is that an initial synopsis
is generated at which side. This cost is mainly affected by the selection selectivity sel(PQ) for a query region Q.
As mentioned in Sections 4.1 and 4.2, in partitionJoin and synopsisJoin plans, a query region Q is partitioned into subregions Qc and a join column or a synopsis is distributed to subregions.
However, as mentioned above, as SJSR.A increases, the number of tuples of L which will contribute to a join result increases. Thus, in each subregion, a small number of tuples will be filtered out. But the cost to distribute PA(R) or a synopsis
of PA(R) is required. Thus, when SJSR.A is greater than 0.7, our partition algorithm computes that no partitioning of QL is beneficial. In these cases, LpartitionJoin acts as LsideJoin. Compared to LpartitionJoin, LsynopsisJoin is slightly better since a synopsis of PA(R) instead of PA(R) is sent to covmin(QL).
In contrast, QR is partitioned into several subregions in RpartitionJoin and RsynopsisJoin since SJSL.A is fixed at 0.5. Thus,
RpartitionJoin and RsynopsisJoin show better performance than RsideJoin.
LfullsynopsisJoin and RfullsynopsisJoin show the best performance when SJSR.A is smaller than 0.7 since the partition strategy is applied to both side and many tuples are filtered out. However, when SJSR.A is greater than 0.7, LfullsynopsisJoin and
RfullsynopsisJoin are worse than RpartitionJoin and RsynopsisJoin since L/RfullsynopsisJoin act as LsideJoin on QL although L/
RfullsynopsisJoin behave on QR like RsynopsisJoin.
One of interesting points is that the costs of LsideJoin, LpartitionJoin, and LsynopsisJoin are greater than that of coverJoin
when SJSR.A is greater than 0.7. In particular, when SJSR.A is 1.0, the costs of LsideJoin and LpartitionJoin approach to that of
baseJoin. This result indicates that using a fixed in-network join plan blindly wastes the energy as much as a naive join plan
(i.e., baseJoin), and therefore, cost based optimization is required.
Transmission Cost (x1000)
Transmission Cost (x1000)
2400
2200
2000
1800
1600
1400
1200
1000
800
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2400
baseJoin
2200
coverJoin
2000
LsideJoin
1800
RsideJoin
1600
LpartitionJoin
1400
RpartitionJoin
1200
LsynopsisJoin
1000
800
RsynopsisJoin
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SJS R.A
SJS R.A
LfullsynopsisJoin
RfullsynopsisJoin
(a) Estimated Cost
Fig. 9. Join selectivity results (SJSL.
(b) Actual Cost
A
= 0.5).
3455
3400
Transmission Cost (x1000)
Transmission Cost (x1000)
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
2900
2400
1900
1400
900
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
sel(PQR)
(a) Estimated Cost
3400
baseJoin
coverJoin
2900
LsideJoin
2400
RsideJoin
1900
LpartitionJoin
1400
RpartitionJoin
900
LsynopsisJoin
400
RsynopsisJoin
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
LfullsynopsisJoin
sel(PQR)
RfullsynopsisJoin
(b) Actual Cost
Fig. 10. Selection selectivity results (selðP Q L Þ ¼ 0:5).
Compared to actual costs, the average of relative error rates (=j(estimated cost actual cost)/actual costj) is about 5% and
the maximum of the relative error rates is 18%. Therefore, the proposed cost models accurately estimate the transmission
costs.
5.2.2. Selection selectivity
This experiment tests the effect of the selection selectivity. We set the selection selectivity of QL (= selðPQ L Þ) to 0.5 and vary
the selection selectivity of QR (= selðPQ R Þ) from 0.1 to 1.0. Other parameters are all set to their default values. The results are
shown in Fig. 10. The patterns of the estimated costs and the actual costs are also similar. The average of relative error rates is
about 7%.
In this experiment, INJECT chooses LpartitionJoin as the optimal plan when selðP Q R Þ is small (i.e., 0.10.3). As shown in
Fig. 10(b), LfullsynopsisJoin shows the best performance when selðPQ R Þ is small. However, the performance gap between LfullsynopsisJoin and LpartitionJoin is quite small. In addition, as expected by INJECT, RfullsynopsisJoin shows the best performance
on other cases.
As the selection selectivity increases, the transmission cost also increases. Thus, in contrast to the result for baseJoin in
Fig. 9, the cost of baseJoin increases. The cost of coverJoin also increases. But, since R n L and L n R instead R and L are sent
from covmin(QL [ QR) to the base station in coverJoin, coverJoin is better than baseJoin.
As selðP Q R Þ increases, the number of tuples of R satisfying the selection predicate increases. Thus, it is beneficial that a
semijoin is performed at QR side. Therefore, as shown in Fig. 10, when selðPQ R Þ is greater than selðP Q L Þð¼ 0:5Þ, a plan in which
a semijoin is performed at QR side (e.g., RsideJoin) has a smaller transmission cost than its counterpart (e.g., LsideJoin).
As mentioned in Section 5.2.1, the behaviors of LfullsynopsisJoin and RfullsynopsisJoin are similar except that an initial synopsis is constructed at which side. As presented in Eqs. (9) and (12), the synopsis construction cost is affected by the selection
selectivity. In LfullsynopsisJoin, the initial synopsis is constructed at QR side. As selðP Q R Þ increases, the synopsis construction
cost on QR increases. Thus, the performance of LfullsynopsisJoin becomes worse than that of RfullsynopsisJoin.
Recall that, as mentioned in Section 5.2.1, partitioning a query region is beneficial when the join selectivity is not high. In
this experiment, we set SJSR.A and SJSL.A to 0.5. Therefore, as selðP Q R Þ increases, the performance gap between RsideJoin which
gathers tuples satisfying a selection predicate at covmin(QR) and the join plans applying partition strategy on QR (i.e., RpartitionJoin, RsynopsisJoin, L/RfullsynopsisJoin) increases.
5.2.3. Cover node depth
In this experiment, we vary the hop distance and show its effect. We set DC and DL to 1 and 2, respectively. We change DR
from a smaller value to a larger value than DL. The results are shown in Fig. 11.
As shown in Fig. 11, since we set the selection selectivity and join selectivity at each side to the same, when DR and DL are
the same (=2), the performance of a plan performing a semijoin at QR and that of its counterpart are the same.
Although the size of transmitted data of covmin(QL) and that of covmin(QR) are the same, nodes having a larger hop distance
S
to covmin(QL QR) have a larger transmission cost. In the case of LsynopsisJoin and RsynopsisJoin, the smaller cost plan is
changed from RsynopsisJoin to LsynopsisJoin as DR increases. Other plans also show similar patterns. The average relative error
rate of all cases is about 6%.
5.2.4. Memory requirement
Sensor nodes generally have limited memory space. Therefore, the memory consumption of each join plan should be considered. In this section, we show the minimum and maximum memory requirements of each in-network join plan with default parameters. In Table 3, the minimum size and maximum size (in bytes) of accumulated memory requirement of a node
as well as the corresponding nodes in parentheses are presented. Lci is a child of covmin(QL).
3456
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
Transmission Cost (x1000)
Transmission Cost (x1000)
2800
2600
2400
2200
2000
1800
1600
1400
1200
1000
800
1
2
3
4
2800
2600
baseJoin
coverJoin
2400
2200
LsideJoin
2000
1800
RsideJoin
LpartitionJoin
1600
1400
RpartitionJoin
1200
LsynopsisJoin
1000
800
RsynopsisJoin
1
DR
2
3
4
DR
(a) Estimated Cost
LfullsynopsisJoin
RfullsynopsisJoin
(b) Actual Cost
Fig. 11. Cover node depth results (DL = 2).
Table 3
Memory requirement.
Join plan
MIN data size
MAX data size
coverJoin
LsideJoin
RsideJoin
LpartitionJoin
RpartitionJoin
LsynopsisJoin
RsynopsisJoin
fullsynopsisJoin
S
443,344 (covmin(QL QR))
S
220,880 (covmin(QL QR))
220,880 (covmin(QR))
61,764 (Lc2)
62,536 (Rc2)
21,018 (Lc2)
22,118 (Rc2)
21,018 (Lc5)
222,464 (covmin(QL))
S
222464 (covmin(QL QR))
S
220,880 (covmin(QL QR))
S
222,464 (covmin(QL QR))
S
220,880 (covmin(QL QR))
S
222,464 (covmin(QL QR))
44,998 (Lc6)
S
Because, in coverJoin, covmin(QL QR) keeps whole L and R to perform R n L and L n R, coverJoin requires larger memory
space. LsideJoin and RsideJoin reduce the memory requirement because one relation of L and R should be kept at each cover
node.
partitionJoin requires smaller memory space than sideJoin because a relation is partitioned into subregions. In the case of
synopsisJoin, the required memory space is smaller than the partitionJoin because a synopsis is used instead of the projection
S
result. However, in partitionJoin and sideJoin, covmin(QL QR) keeps the other relation.
fullsynopsisJoin reduces required memory space much more because the partitioning strategy is applied to both query region. Therefore, although the memory space of a sensor is smaller than a relation, fullsynopsisJoin can be applied.
6. Related work
Joins are common in applications for target tracking, event detecting, correlation analysis, and so on. In the database literature, several sensor data management systems such as Cougar [7] and TinyDB [14] have been introduced. However, these
systems do not support joins efficiently. Basically, in these systems, joins are performed at the base station like the baseJoin
plan.
Recently, some works for the in-network join processing have been conducted.
Albadi et al. propose REED [1] which is a distributed join algorithm for event detection. In REED, some conditions for event
detection are specified as a stable relation. REED supports joins between sensor data and a static relation built outside the
networks. Thus, they do not consider joins of sensor readings in distinct regions.
Bonfils and Bonnet [4] suggest an adaptive algorithm for finding the optimal join location. Similar work is performed [19]
for the environments that data is transmitted through a hierarchy of network nodes with progressively increasing computing
power and network bandwidth. This work considers that the join is performed at only one node.
Yang et al. [22], propose the two-phase self join (TPSJ) approach. TPSJ is a kind of the semantic optimization technique. In
TPSJ, the base station gathers some sensor readings which act as a stable relation like REED [1]. Then, the gathered sensor
readings are distributed through sensor networks and each sensor performs join operation. In TPSJ, the distribution cost of
gathered readings is reduced if currently gathered readings are contained in previously distributed readings. But in order to
check the containment, the sensor reading composing the stable table should be transmitted to the base station.
Some in-network join methods between two regions in sensor networks were presented in [6,17,20]. [17] considers the
optimal join location using the cost model. In this work, the optimal join location is near to the weighted centroid of three
points: two center points of two query regions and the base station. But, the cost models used in these methods are not
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
3457
accurate since they did not consider join selectivities and they assume that all sensor readings in a query region are collected
at the center of the region.
A similar work is done in [20], in this work, as the optimal join location, the weighted Fermat point is obtained.
Coman et al. [6] presented three join plans which are similar to our three basic join plans. External join in [6] is identical
to our baseJoin. Mediated join [6] is similar to our coverJoin since two relation is joined at a node. Like [20] the weighted Fermat point is considered as the join location in mediate join. But, since our work is based on tree routing, the join location of
S
S
coverJoin is covmin(QL QR). If we align the routing tree to make the weighted Fermat point covmin(QL QR), coverJoin is quite
similar to mediate join. Local join [6] is similar to our sideJoin since a relation is transmitted into the counterpart. Unlike sideJoin, local join distributes a relation into all nodes in the counterpart region. Note that, in partitionJoin, a generalization of
sideJoin, a relation (actually, a join column) is distributed with respect to the cost model. Thus, with the cost model of partitionJoin, we can obtain the more efficient distribution plan.
All work above except [6] focuses on a general join operator. In order to reduce the communication overhead, we consider
semijoins and synopsis joins. Semijoins and synopsis joins are used widely in the distributed database environments [2].
However, these methods are not applied directly since sensor networks are hierarchically structured.
In [23], a synopsis join algorithm for in-network join processing is proposed. In this work, a histogram is used as a synopsis. The authors suggest a synopsis join location based on the cost model. In the aspect of using synopsis joins, the work of
[23] is similar to ours. But, the cost model used in [23] is too rough to use in query optimization since the purpose of the cost
model is finding the weighted centroid of two regions’ centers. Furthermore, this technique cannot be used in tree routing
environments since the synopsis join is performed at any node (i.e., a weighted centroid) in a network, like [17].
Stern et al. propose SENS-Join method [21] in order to avoid shipping tuples through the network that do not join. The
join plan of SENS-Join consists of two phase like [23]. In [21], SENS-Join uses a compact representation of join attribute values based on the quad tree. Some techniques used in [21], such as gathering values of a join attribute rather than whole tuples, can be applied in our work orthogonally. However, in [21], the cost model for SENS-Join is not presented. As mentioned
above, according to various conditions, the optimal query plan is changed. Thus, the cost model is an indispensable component to choose the best query plan. Also, they only show the efficiency of SENS-Join compared with baseJoin.
7. Conclusion
In this work, we suggest the cost based join strategy, called INJECT, for tree routing sensor networks. To do this, we suggest diverse join plans. And, based on the basic cost model to gather sensor readings in a region and to transmit the gathered
data to a certain node via tree routing, we devise cost models for diverse join plans. Since we devise the cost models based on
reasonable assumptions, we have confidence that our cost models can be easily extended to other join plans.
To show the performances of our devised join plans and effectiveness of the cost based query optimization in tree routing
sensor networks, we implement diverse join plans, and conduct an extensive experimental study over diverse conditions.
In our experiment, we show that some devised join plans show the best performance over other cases. However, there is
no superior plan over all cases. Our experiments show that our proposed method, INJECT, chooses the optimal or near optimal join plan over diverse cases. Thus, INJECT extends the lifetime of sensor networks.
This work focuses on the efficient in-network processing for a single join query. Thus, as the future work, we will conduct
a study about a technique that efficiently processes multiple join queries in networks, simultaneously, as well as the corresponding cost model.
Acknowledgements
We would like to thank the editor and anonymous reviewers for their helpful comments. This work was supported in part
by the National Research Foundation of Korea grant funded by the Korean government (No. 2010-0016165) and in part by
the National Research Foundation of Korea grant funded by the Korean government (MEST) (No. 2011-0000377).
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
D.J. Abadi, S. Madden, W. Lindner, Reed: robust, efficient filtering and event detection in sensor networks, in: Proceedings of VLDB, 2005, pp. 769–780.
P.A. Bernstein, D.-M.W. Chiu, Using semi-joins to solve relational queries, J. ACM 28 (1) (1981) 25–40.
B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM 13 (7) (1970).
B.J. Bonfils, P. Bonnet, Adaptive and decentralized operator placement for in-network query processing, in: Proceedings of Information Processing in
Sensor Networks, 2003, pp. 47–62.
D. Chu, A. Deshpande, J.M. Hellerstein, W. Hong, Approximate data collection in sensor networks using probabilistic models, in: Proceedings of ICDE,
2006, p. 48.
A. Coman, M.A. Nascimento, J. Sander, On join location in sensor networks, in: Proceedings of International Conference on Mobile Data Management
(MDM), 2007, pp. 190–197.
A. Demers, J. Gehrke, R. Rajaraman, N. Trigoni, Y. Yao, The cougar project: a work-in-progress report, SIGMOD Record 32 (3) (2003) 9–18.
A. Deshpande, C. Guestrin, S. Madden, J.M. Hellerstein, W. Hong, Model-driven data acquisition in sensor networks, in: Proceedings of VLDB, 2004, pp.
588–599.
W.R. Heinzelman, A. Chandrakasan, H. Balakrishnan, Energy-efficient communication protocol for wireless microsensor networks, in: Proceedings of
Annual Hawaii International Conference on System Sciences(HICSS), 2000.
A.R. Hevner, S.B. Yao, Query processing in distributed database systems, IEEE Trans. Software Eng. 5 (3) (1979) 177–187.
3458
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
J.-K. Min et al. / Information Sciences 181 (2011) 3443–3458
Y. Kotidis, Snapshot queries: towards data-centric sensor networks, in: Proceedings of ICDE, 2005, pp. 131–142.
S. Lindsey, C.S. Raghavendra, K.M. Sivalingam, Data gathering in sensor networks using the energy delay metric, in: Proceedings of IPDPS, 2001, p. 188.
S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, Tag: a tiny aggregation service for ad-hoc sensor networks, in: Proceedings of OSDI, 2002.
S. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, Tinydb: an acquisitional query processing system for sensor networks, ACM Trans. Database Syst.
30 (1) (2005) 122–173.
F. Marcelloni, M. Vecchio, Enabling energy-efficient and lossy-aware data compression in wireless sensor networks by multi-objective evolutionary
optimization, Inform. Sci. 180 (10) (2010) 1924–1941.
C. Ok, S. Lee, P. Mitra, S. Kumara, Distributed routing in wireless sensor networks using energy welfare metric, Inform. Sci. 180 (9) (2010) 1656–1670.
A. Pandit, H. Gupta, Communication-efficient implementation of range-joins in sensor networks, in: Proceedings of DASFAA, 2006, pp. 859–869.
A. Silberstein, R. Braynard, Y. Yang, Constraint chaining: on energy-efficient continuous monitoring in sensor networks, in: Proceedings of ACM
SIGMOD, 2006, pp. 157–168.
U. Srivastava, K. Munagala, J. Widom, Operator placement for in-network stream query processing, in: Proceedings of ACM PODS, 2005, pp. 250–258.
M. Stern, Optimal locations for join processing in sensor networks, in: Proceedings of International Conference on Mobile Data Management (MDM),
2007, pp. 336–340.
M. Stern, E. Buchmann, K. Böhn, Towards efficient processing of general-purpose joins in sensor networks, in: Proceedings of ICDE, 2009.
X. Yang, H.B. Lim, M.T. Özsu, K.L. Tan, In-network execution of monitoring queries in sensor networks, in: Proceedings of ACM SIGMOD, 2007, pp. 521–
532.
H. Yu, E.-P. Lim, and J. Zhang, On in-network synopsis join processing for sensor networks, in: Proceedings of the International Conference on Mobile
Data Management, 2006, p. 32.