Distributed similarity search algorithm in distributed - ISLAB

Information Processing Letters 75 (2000) 35–42
Distributed similarity search algorithm in distributed
heterogeneous multimedia databases
Ju-Hong Lee a,1 , Deok-Hwan Kim a,2 , Seok-Lyong Lee a,3 , Chin-Wan Chung b,∗ ,
Guang-Ho Cha c,4
a Department of Information and Communication Engineering, Korea Advanced Institute of Science and Technology 373-1, Kusong-Dong,
Yusong-Gu, Taejon 305-701, South Korea
b Department of Computer Science, Korea Advanced Institute of Science and Technology 373-1, Kusong-Dong, Yusong-Gu,
Taejon 305-701, South Korea
c IBM Almaden Research Center, San Jose, CA, USA
Received 5 November 1999; received in revised form 30 March 2000
Communicated by K. Iwama
Abstract
The collection fusion problem in multimedia databases is concerned with the merging of results retrieved by content based
retrieval from distributed heterogeneous multimedia databases in order to find the most similar objects to a query object. We
propose distributed similarity search algorithms, two heuristic algorithms and an algorithm using the linear regression, to solve
this problem. To our knowledge, these algorithms are the first research results in the area of distributed content based retrieval
for heterogeneous multimedia databases.  2000 Elsevier Science B.V. All rights reserved.
Keywords: Distributed similarity search algorithms; Collection fusion; Multimedia databases; Information retrieval
1. Introduction
Along with the current growth of the Internet and
the Web, it emerges as an important research issue to
access distributed multimedia databases. To retrieve
information from numerous data sources, the global
server is needed to integrate various resources and
process queries in a distributed manner [2]. It distributes user queries to local databases, integrates results
to fit user requirements, and also provides the illusion
∗ Corresponding author. Email: [email protected].
1 Email: [email protected].
2 Email: [email protected].
3 Email: [email protected].
4 Email: [email protected].
of a single database. A key problem is how to extract
relevant objects for a query from distributed heterogeneous databases that use different similarity measures.
This issue is called the “collection fusion problem”. It
has been studied much for existing text databases [1,
3,5,8], but not for multimedia databases. The problem
arises from the difference of similarity measures in a
heterogeneous environment. The detailed scenario is
as follows: At the global server, a user wants to retrieve
objects similar to a query object from local databases
using a global similarity measure. However, a local
database does not support a global similarity measure,
but a local similarity measure. When a global similarity measure is completely different from a local similarity measure, for instance, the global similarity mea-
0020-0190/00/$ – see front matter  2000 Elsevier Science B.V. All rights reserved.
PII: S 0 0 2 0 - 0 1 9 0 ( 0 0 ) 0 0 0 6 8 - 5
36
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
sure using color and the local similarity measure using
texture, a user cannot get an appropriate result for a
query. Therefore, a global similarity measure must be
correlated with a local similarity measure.
In this paper, we show that there exist some cases
that a linear relationship between two similarity measures holds. And we propose novel distributed similarity search algorithms to solve the collection fusion
problem for such cases in distributed heterogeneous
multimedia databases. This paper is organized as follows. Section 2 defines the collection fusion problem
and provides assumptions for the problem. In Section 3, we propose distributed similarity search algorithms to solve the collection fusion problem. The experimental results are shown in Section 4. Concluding
remarks are made in Section 5.
2. Collection fusion for distributed multimedia
databases
Objectives. For distributed similarity search of a
i
be the set of relevant objects
given query Q, let RQ
i
be the set of irrelevant
in the ith local database and IQ
i
i
∩ IQ
=∅
objects in the ith local database. Then RQ
i
i
and RQ ∪ IQ = {all objects in the ith local database}.
Let VQi be the set of objects retrieved from the ith local
database. We have the constraint that the total number
of retrieved objects from local databases is fixed such
as
n
X
i
V = ck,
Q
i=1
where c is a constant larger than 1, k is the number
of relevant objects that a user wants to retrieve, n is
the number of local databases, and |S| is the number
of elements of set S. The objectives of the collection
fusion problem with this constraint are as follows:
(1) The ratio of retrieved objects among relevant
objects should be maximized. That is,
We discuss several assumptions concerning the
global server and local databases. The algorithms
proposed in this paper are developed based on these
assumptions.
Assumption 1. The global server selects local databases supporting similarity measures that are correlated with a global similarity measure, and then submits the query to them.
maximize
n
n
X
i
. X
i R ∩ V i R Q
Q
Q
i=1
i=1
subject to the constraint
n
X
i
V = ck.
Q
i=1
(2) The ratio of irrelevant objects among retrieved
objects should be minimized. That is,
minimize
Assumption 2. Local databases support the incremental similarity ranking such as the method using a
get-more-objects facility described in [7].
n
n
X
i
. X
i
I ∩ V i V Q
Q
Q
i=1
i=1
subject to the constraint
n
X
i
V = ck.
Q
i=1
Assumption 3. For a given query, local databases
return objects locally most similar to the query object
together with their local similarity values as the query
result.
The following are the formal definition and objectives of the collection fusion problem.
Definition. Collection fusion problem in multimedia
databases is how to retrieve and merge the results
from distributed heterogeneous multimedia databases
to find relevant objects, that is, k most similar objects
to a query object using a global similarity measure.
(1) is to maximize the recall and (2) is to maximize
the precision because the precision is
1−
n
n
X
i
. X
i
I ∩ V i V .
Q
Q
Q
i=1
i=1
Since we assume that servers are independent and
autonomous, their similarity measures may be different from each other. Therefore the similarity value by
a local similarity measure between an object from a local database and a query object may be different from
that by a global similarity measure at the global server.
There are many similarity measures for content based
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
Fig. 1. Scatter diagram for the average RGB color 4 × 4 and the
average RGB color 5 × 5 in the case that arbitrary pairs among
images are chosen.
image retrieval. There is correlation between some
similarity measures. In order to show such cases, we
present examples using the RGB color and the RGB
texture.
Example 1. Let the global server and the local database support the image similarity search using the
color. The global server extracts average RGB color
features for the 6 × 6 subimages from an image
and measures its similarity value against a query image using the inter-feature normalization described in
MARS [6]. The local database extracts average RGB
color features for the 4 × 4 subimages from an image
and measures its similarity value as the global server
does. From 3016 images, 3000 arbitrary pairs of images are selected. For each pair, the local similarity
value x and the global similarity value y are measured.
The scatter diagram of the set of (x, y) values for 3000
selected pairs is shown in Fig. 1. In this case, the diagram shows the shape of a straight line.
Example 2. In Fig. 2, the similarity values of the y
coordinate are obtained using the average RGB color
of 5 × 5 subimages while those of the x coordinate are
obtained using the RGB texture of 6 × 6 subimages.
Contrary to the previous case, the scatter diagram
does not show any relationship between two similarity
measures with different attributes.
Although similarity measures are different between
the global server and local databases, we observed
that the scatter diagram of similarity values of some
pairs of similarity measures showed the shape of a
straight line. Since the relationship cannot be proved,
37
Fig. 2. Scatter diagram for the average RGB color 5 × 5 and the
RGB texture 6 × 6 in the case that arbitrary pairs among images are
chosen.
instead, we made extensive experiments that showed
the linear relationship. Table 1 shows three groups of
features, that is, RGB colors, RGB textures and RGB
colors & textures, to be used for similarity measures.
We used the inter-feature normalization described
in MARS [6] to calculate similarity values. The
statistical linear regression method is used to obtain
the equation of a straight line and the test of statistical
hypothesis is used to verify the linear relationship
between two similarity measures. As test indicators,
we used the scatter diagram, the sample coefficient
of determination (r 2 ), and the analysis of variance
(F0 , F (α)) where r 2 is given by (sum of squares due
to linear regression)/(total variance), F0 is given by
(mean square due to linear regression)/(mean square
of residual), and F (α) is obtained from F -distribution
for a level of significance α. If the linear regression
model is effective for two similarity measures, the
scatter diagram should show the shape of a straight
line, r 2 (0 < r 2 < 1) should be near to 1 and F0
should be larger than F (α) [9,10]. Table 2 shows the
result of experiments for two similarity measures. In
the case of similarity measures from the same group,
the scatter diagram shows the shape of a straight line,
r 2 value is near to 1, and F0 is much larger than
F (α). However, in the case of similarity measures
from different groups, the scatter diagram does not
show the shape of a straight line and r 2 value is near
to 0. And F0 in this case is much smaller than F0 in
the case that the linear relationship is satisfied even
though F0 is larger than F (α). In this case, we can say
that two similarity measures do not satisfy the linear
relationship.
38
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
Table 1
The description of three groups of features to be used for similarity measures
RGB color features
RGB texture features
RGB color & texture
Feature name
Feature description
feat1
average RGB color feature for 2 × 2 subimages
feat2
average RGB color feature for 3 × 3 subimages
feat3
average RGB color feature for 4 × 4 subimages
feat4
average RGB color feature for 5 × 5 subimages
feat5
average RGB color feature for 6 × 6 subimages
feat6
average RGB texture feature for 2 × 2 subimages
feat7
average RGB texture feature for 3 × 3 subimages
feat8
average RGB texture feature for 4 × 4 subimages
feat9
average RGB texture feature for 5 × 5 subimages
feat10
average RGB texture feature for 6 × 6 subimages
feat11
average RGB color & texture feature for 2×2 subimages
feat12
average RGB color & texture feature for 3 × 3 subimages
feat13
average RGB color & texture feature for 4 × 4 subimages
feat14
average RGB color & texture feature for 5 × 5 subimages
Table 2
Test of statistical hypothesis for linear relationship between two similarity measures
Correlation
ρ
r2
F0
F (0.05)
Result
straight line
0.9847
0.970
96566
0.000
linear
straight line
0.964
0.923
36327
0.000
linear
feat6 : feat8
straight line
0.9604
0.922
35839
0.000
linear
feat3 : feat5
straight line
0.9969
0.994
490367
0.000
linear
feat8 : feat10
straight line
0.9926
0.985
202061
0.000
linear
feat11 : feat14
straight line
0.9677
0.937
44451
0.000
linear
Features to be used for
similarity measures
Scatter diagram
feat1 : feat2
feat1 : feat4
feat7 : feat9
straight line
0.9842
0.969
93264
0.000
linear
feat12 : feat13
straight line
0.9960
0.992
377866
0.000
linear
feat1 : feat9
scattered
0.0708
0.005
15.16
0.000
nonlinear
feat6 : feat12
scattered
0.1319
0.017
53.37
0.000
nonlinear
feat5 : feat10
scattered
0.0563
0.003
9.564
0.002
nonlinear
feat1 : feat10
scattered
0.0513
0.003
7.945
0.005
nonlinear
For any two similarity measures, if they satisfy the
linearity relationship, we can use that property for the
distributed similarity search.
3. Distributed similarity search algorithm
The distributed similarity search algorithm retrieves
k most similar objects using a global similarity mea-
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
39
r
number of retrievals for one local database
LDi
ith local database
it is sufficient for the global server to get pi =
[ck/n] objects only once from each local database,
where [ ] is the rounding operator. However, the values
are different for each LDi and cannot be known in
advance. Therefore we must refine them repeatedly.
If the repetition is r, the initial value of pi is given
by pi = [ck/rn]. Step (6) of the above algorithm
assigns a large value to pi of the local database whose
heuristic estimator is high in order to increase the
recall and the precision.
pi
number of objects to be retrieved from LDi in one
step
3.1. Average ranking heuristic
Table 3
Parameters used in the algorithms
q
query object of distributed similarity search
k
number of objects to find
c
multiplication ratio (>1) when more than k objects
are retrieved
n
number of local databases
sure from n local databases, LDi , i = 1, . . . , n. The
algorithm must result in high recall and high precision
to achieve objectives of the collection fusion problem
stated in Section 2. We suggest two heuristic algorithms and an algorithm using linear regression for distributed similarity search.
Table 3 shows parameters to be used in the algorithms.
Heuristic Algorithm (q, c, k, n, LD1 , . . . , LDn )
(1) send a query object q to all LDs
(2) For each LDi , initialize pi
(3) While (number of retrieved objects < ck)
(4) for each LDi , get_more_objects(q, pi , LDi )
let resulti be the set of objects that
are retrieved from LDi .
(5) merge_results(result1 , . . . , resultn )
(6) for each LDi , recalculate pi using
heuristic estimator of LDi
(7) EndWhile
Where, merge_results(result1 , . . . , resultn ) merges
and ranks results retrieved from all local databases
using a global similarity measure and get_more_
objects(q, pi , LDi ) requests LDi to get pi more objects similar to the query q using a local similarity
measure of LDi as described in [7].
If the global server retrieves exactly k objects from
local databases, the recall will be less than 1 because
there will be some irrelevant objects in the retrieved
objects. Therefore the global server must get more
than k, that is ck (c > 1) objects. The precision,
however, will be decreased, as c increases. The recall
has a tradeoff relation to the precision. If all local
databases show the same recall and the same precision,
A heuristic estimator αi is defined as:
αi = Mi /
Mi
X
Rankij ,
j =1
where Rankij is the merged rank of the j th object
retrieved from the ith local database and Mi is the
number of objects retrieved in the last retrieval from
the ith local database. This value means the reciprocal
of the average of merged ranks of objects retrieved
from the ith local database. The global server gets
more objects from a local database with a high value
of αi and less objects from one with a low value of αi .
pi of the heuristic algorithm is given as follows:
αi
k
·
.
pi =
r α1 + · · · + αn
3.2. Average global similarity heuristic
This is similar to the average ranking heuristic.
The rank has an integer value that has a uniform
difference between adjacent ranked objects. However,
the similarity difference between adjacent objects may
not be uniform. So, the heuristic estimator βi is
defined as:
βi =
Mi
X
Global_Similarityij Mi .
j =1
This value means the average similarity of the objects
retrieved from the ith local database. pi of the heuristic algorithm is given as follows:
βi
k
·
.
pi =
r β1 + · · · + βn
40
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
Algorithm (p, c, k, n, q, T , LD1 , . . . , LDn )
(1) for each LDi , i = 1, . . . , n, get_more_objects(q, p, LDi )
(2) for each LDi , i = 1, . . . , n, analyze objects retrieved from LDi using the linear regression analysis
and obtain equation ŷi = α̂i + β̂i xi and obtain gti where gti is one of gtiu , gtim , gtil according to T
(3) let LDl be the local database which has the largest GT among all local databases and its GT be gtl
(4) if (the total number of retrieved objects with the global similarity value greater than gtl ) > k or
(the total number of retrieved objects) > ck then stop
(5) select the LDl which has the largest GT among all local databases
(6) get_more_objects(q, p, LDl )
(7) analyze objects from LDl using linear regression
(8) goto step (3)
3.3. Distributed similarity search algorithm using the
linear regression
In Section 2, we observe that there exist similarity measures that have the linear relationship between
them. For these cases, we can apply the linear regression analysis to a distributed similarity search. The
linear equation, ŷ = α̂ + β̂x, of the straight line in a
scatter diagram is obtained by using the linear regression analysis. The algorithm retrieves the predefined p
number of objects from each local database and analyzes retrieved objects to find the linear equation and
the global threshold (GT) corresponding to the local
threshold (LT). The least of local similarity values
of retrieved objects becomes the local threshold. This
algorithm uses three different global thresholds, gt u ,
gt m , gt l corresponding to the local threshold. In Fig. 3,
gt m is the y-coordinate value of the intersection point
of ŷ = α̂ + β̂x and x = LT. gt u is that of the intersection point of ŷ = α̂ + β̂x + dy (dy is 100(1 − δ)%
confidence interval of y) and x = LT. gt l is that of the
intersection point of ŷ = α̂ + β̂x − dy and x = LT. T
indicates the type of the global threshold, one of gt u ,
gt m , gt l . This algorithm selects the local database that
has the largest global threshold and retrieves objects
from the selected database next time.
The above algorithm uses one of three global
thresholds gt m , gt u , gt l . In case gt u , the recall of the
result is high and the precision is low. In case gt m , the
recall is less than the case of gt u while the precision is
higher. In case gt l , the recall is the lowest among the
three cases and the precision is the highest.
Fig. 3. Three local thresholds corresponding to the global threshold.
4. Experiment
In order to measure the effectiveness and performance of the proposed distributed similarity search algorithms, we conducted comprehensive experiments
in an environment containing a large number of image data and various queries. The test data consists of
3016 images with 256 RGB color bitmaps. The contents of test images are shown in Table 4. In order to
show the preciseness of the linear regression of partly
retrieved objects, we present experimental results in
Table 5, indicating that the partial results approach to
the final result gradually. Features to be used for similarity measures are chosen from the RGB color group.
As the number of retrieved objects increases, r 2 , α,
and β approach the final values.
We evaluated the effectiveness of the algorithm using the precision, the recall, and the combined metric that is the product of the recall and the precision.
The combined metric can measure the overall effec-
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
41
Fig. 4. The precision and the recall of each algorithm in the clustered distribution.
Table 4
The contents of test images
Category
# of images
Area
plants
720
flower, leaves, grass
pattern
680
glass, brick, woods
architecture
820
house, building
scene
796
water, sky, cloud
Fig. 5. P × R of each algorithm in the clustered distribution.
Table 5
The preciseness of the linear regression of partly retrieved objects
# of retrieved
objects
MSE*
r2
α
β
70
7.71×10−5
0.7069
−0.071
1.086
110
8.82×10−5
0.7060
0.002
0.992
190
7.16×10−5
0.8053
0.045
0.936
230
6.78×10−5
0.8485
0.025
0.971
299
6.62×10−5
0.8832
0.019
0.962
Total objects
5.12×10−5
0.9836
0.013
0.959
* MSE (mean square error) is (residual sum of squares)/(number
of retrieved objects).
tiveness. We made 10 queries for each test using various parameters and averaged their results.
We assume four local databases with one global
server, where the images are distributed over local
databases. To allocate images to these local databases,
we use two approaches:
(1) random allocation and
(2) clustered allocation.
In the random allocation, all images are distributed
randomly into four databases. In the clustered allocation, similar images are likely to be allocated to
Fig. 6. P × R of each algorithm in the random distribution.
the same local database. The equal number of images
are allocated to each local database. Clusters are generated with centers randomly distributed. Each local
database contains 4 to 5 clusters. About 60% of data
are allocated to clusters, while the rest are distributed
randomly. These two cases are evaluated respectively.
Other test parameter values are 99.9% confidence level
for estimating the confidence interval of y, the upper
type of the global threshold gt u , 1.2 for c value, and
10 for p and initial pi .
The graphs of the precision, the recall, and their
combined metric P × R for three algorithms are
summarized in Figs. 4–6. For the clustered distribution, the algorithm using the linear regression outperforms the average ranking heuristic algorithm (alpha)
42
J.-H. Lee et al. / Information Processing Letters 75 (2000) 35–42
and the average global similarity heuristic algorithm
(beta) because the algorithm using the linear regression (linear) reflects the clustering effect of data distribution well. For the random distribution, the algorithm using the linear regression shows slightly better results than other algorithms. In a real situation,
however, data distributions of databases on the Web
are generally clustered. Therefore, the algorithm using the linear regression will be used more practically.
5. Conclusion
In this paper, we proposed novel distributed similarity search algorithms that solve the collection fusion problem for multimedia databases on the distributed heterogeneous environment like the Web. Experiments show that the algorithm using the linear regression is the best. As far as we know, we first studied the
collection fusion problem of distributed heterogeneous
multimedia databases and presented novel algorithms
as solutions. The search for multimedia databases on
the Web is becoming a very important issue. So algorithms proposed in this paper can be the basis for
future researches in this area.
References
[1] J. Callan, Z. Lu, W. Croft, Searching distributed collection
with inference networks, in: Proc. 18th Annual Internat.
ACM/SIGIR Conference, 1995, pp. 21–28.
[2] W. Chang, G. Sheikholeslami, J. Wang, A. Zhang, Data
resource selection in distributed visual information systems,
IEEE Trans. Knowledge Data Engrg. 10 (6) (1998) 926–946.
[3] L. Gravano, H. Garcia-Molina, Merging ranks from heterogeneous internet sources, in: Proc. 23rd Internat. Conf. on Very
Large Data Bases, August 1997, pp. 14–25.
[4] J.H. Lee, D.H. Kim, C.W. Chung, Multi-dimensional selectivity estimation using compressed histogram information, in:
Proc. ACM SIGMOD Internat. Conf. on Management of Data,
June 1999, pp. 205–214.
[5] W. Meng, K.L. Liu, C. Yu, X. Wang, Y. Chang, N. Rishe,
Determining text databases to search in the Internet, in: Proc.
Internat. Conf. on Very Large Data Bases, August 1998,
pp. 14–25.
[6] M. Ortega, K. Chakrababarti, K. Porkaew, S. Mehrotra, Supporting ranked Boolean similarity queries in MARS, IEEE
Trans. Knowledge Data Engrg. 10 (6) (1998) 905–925.
[7] T. Seidl, H. Kriegel, Optimal multi-step k-nearest neighbor
search, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, June 1998, pp. 154–165.
[8] E. Voorhees, N. Gupta, B. Johnson-Laird, The collection fusion
problem, in: Proc. 3rd Text Retrieval Conference (TREC-3),
1994, pp. 95–104.
[9] R.V. Hogg, E.A. Tanis, Probability & Statistical Inference,
MacMillan Publishing Co., New York, 1977.
[10] S.H. Park, Regression Analysis, DaeYoung Publishing Co.,
1985.