The k-club problem: new results for k=3

The k-club problem: new results for k=3
Maria Teresa Almeida and Filipa Duarte de Carvalho
CIO − Working Paper 3/2008
The k-club problem: new results for k=3
M. T. Almeida and F. D. Carvalho
Instituto Superior de Economia e Gestão, Universidade Técnica de Lisboa
CIO-Centro de Investigação Operacional-FC/UL
January 2008
Abstract: Given an undirected graph G, the k-club problem consists of …nding a
maximum cardinality subset of its nodes that induces a graph of diameter k. Such subsets, called k-clubs, are clique relaxations that represent dense substructures of G. They
provide interesting information on cohesive subgroups in social networks, not revealed by
cliques. They are also used by biologists to study protein interactions.
The k-club problem is N P -hard for any …xed k.
In this paper we present a new integer linear formulation for the k = 3 case and derive
new classes of valid inequalities to strengthen its LP relaxation.
Keywords: k-club; integer programming; valid inequalities; clique; social networks
1
1
Introduction
Social and behavioral scientists use network representations to study linkages between
groups or individuals in societies and organizations. Biologists use them to study interactions between proteins. In such studies it is important to identify dense structures i. e.,
subsets of nodes with a high density of interconnections. The highest density structure is
the well known clique, [3], but it is considered overly restrictive in those contexts. Clique
relaxations such as k-cliques and k-clubs represent cohesive subgroups and provide interesting information not revealed by cliques. A discussion of k-clubs and k-cliques in the
context of social networks can be found in [1] and [7]. For a discussion in the context of
biological networks the reader is referred to [2].
Given an undirected graph G, a k-club is a subset of nodes of G that induces a subgraph
of diameter k. For k = 1 a k-club and a k-clique are a clique. For k > 1 a k-club is a
k-clique but the converse is not true. A k-clique is a set S of nodes such that every pair
is linked by a chain of length at most k in G, but not necessarily in the subgraph induced
by S. For S to be a k-club, nodes outside S may not be used to link pairs of nodes in S.
A classical example that illustrates the di¤erence between k-clubs and k-cliques is given
in [1]. The k-club problem consists of …nding a maximum cardinality k-club of a graph.
The k-club problem was proven to be N P -hard by Bourjolly et al. [4].
Bourjolly et al. [5] stated some properties of k-clubs and developed three heuristic
procedures for the identi…cation of large cardinality k-clubs. In [4] the k-club problem
was formulated as an integer linear program and a simpli…ed version of it for the case
k = 2 was presented. An exact enumerative algorithm was also developed to solve the
k-club problem. Balasundaram et al. [2] studied the 2-club polytope and established some
polyhedral results, [8]. More recently, Carvalho and Almeida [6] presented new families of
2
valid inequalities for the 2-club polytope and derived conditions for them to de…ne facets.
This paper is organized as follows. In section 2 we introduce de…nitions and notation
and in section 3 we review the integer linear programming formulation for the k-club problem presented in [4]. In section 4 we show that the 3-club problem is polynomialy solvable
in a special class of graphs, and present a new integer linear programming formulation
followed by new classes of valid inequalities.
2
De…nitions and Notation
For each node i 2 V , the set of nodes linked to i by an edge in E will be represented by
Ni and called the set of its neighbours. The degree of node i, degG (i), is the cardinality
of Ni .
The distance between two nodes i and j, distG (i; j), is the number of edges in a
shortest chain linking i to j in G. The diameter of G = (V; E), diam(G), is the maximum
distance between two nodes of G.
A subset of nodes, I
V , such that distG (i; j) > 3 for all i; j 2 I, is called a
3-independent set.
The spacing between any two edges e1 = (i1 ; j1 ) and e2 = (i2 ; j2 ) is de…ned as
spacG (e1 ; e2 ) =
If D
max
i;j2fi1 ;i2 ;j1 ;j2 g
fdistG (i; j)g
E is a set of edges such that spacG (e; f ) > 3 for all e; f 2 D, then D will be
called a 3-spac set.
3
The spacing between a node i and an edge e = (u; v) is de…ned as
spacG (i; e) = max fdistG (i; u); distG (i; v)g
If S is a subset of nodes of G = (V; E), the subgraph induced by S will be denoted
G[S] and the set of edges in E with both ends in S will be represented by E(S; S).
If diam(G)
k then the optimal solution of the k-club problem is the whole set V
and the problem is trivial. Throughout the remainder of the paper we will assume that
diam(G) > k.
3
Chain Formulation for the k-Club Problem, [4]
A pair of nodes, i and j, may belong to a k-club in G = (V; E) if and only if there is a
chain of length at most k linking i and j such that every node in the chain belongs to the
k-club.
The k-club problem was formulated as an integer linear problem by Bourjolly et al.
[4] as follows.
For any two nodes i; j 2 V , let Cijk be the set of all chains of length at most k linking
i and j and denote by Vt the vertex set of a chain t. For every i 2 V , let xi be a binary
variable equal to 1 if and only if i belongs to the solution and let yt be a binary variable
associated with chain t 2 C, C = [i;j2V Cijk .
4
CHAIN F ORM U LAT ION
M AX
Z=
P
xi
i2V
s:t:
xi + xj
1
xi + xj
1+
P
if (i; j) 2
= E and Cijk = ?
yt
k
t2Cij
yt
xr
if (i; j) 2
= E and Cijk 6= ?
for all t 2 C and r 2 Vt
xi 2 f0; 1g
i2V
yt 2 f0; 1g
t2C
Note that, along with a variable associated with every node of graph G, there is one
variable associated with every chain, with at most k edges, linking each pair of nonadjacent
nodes.
For k = 2, each chain t 2 C has a single internal node that can be used to represent the
chain. In this case the yt variables are not needed as shown in [4]. For k = 3, we present
in section 4:2 an alternative formulation with variables associated with the edges, rather
than with the chains of length at most k, which reduces the total number of variables
needed in the model.
4
The 3-Club Problem
This section is devoted to the 3-club problem. In section 4:1 we show that the problem is
polynomialy solvable in a special class of graphs. In section 4:2 we present a new integer
linear programming model alternative to the chain formulation. In sections 4:3 and 4:4
we present new valid inequalities.
5
4.1
A polynomial case
If G is a tree, any feasible solution for the 3-club problem must be the node set of
a subtree with diameter 3. In such a subtree there is at least one chain with three
edges and the total number of nodes in the subtree is given by the sum of the degrees
of that chain’s internal nodes. In this case, to determine an optimal solution for the
3-club problem one only needs to identify an edge e = (i ; j ) 2 E such that e =
arg max fdegG (i) + degG (j) : e = (i; j) 2 Eg.
4.2
Neighbourhood formulation
The chain formulation for the 3-club problem may have a very large number of variables,
due to the number of chains in C = [i;j2V Cij3 it has to deal with.
To reduce the number of variables, one may interpret a 3-club as a subset S of nodes
such that for any pair i; j 2 S, at least one of the following conditions holds:
i and j are linked by an edge (i; j) 2 E;
there is a node r in S linked to nodes i and j, i.e., such that (i; r) 2 E and (r; j) 2 E;
there are in S two nodes, p and q, such that p is a neighbour of i, q is a neighbour
of j and (p; q) 2 E.
Let us associate a binary variable xi with each node i 2 V and a variable zij with each
edge (i; j) 2 E. A 3-club in G = (V; E) is a subset of nodes represented by a point in the
subset of ZjV j+jEj de…ned by
6
xi + xj
xi + xj
i; j 2 V : distG (i; j) > 3
1
P
1+
xr +
r2(Ni \Nj )
P
zpq
p2Ni ;q2Nj
(p;q)2E
(1)
i; j 2 V : (i; j) 2
= E;
distG (i; j)
3
0
(2 )
zij
xi
(i; j) 2 E
(3)
zij
xj
(i; j) 2 E
(4)
xi 2 f0; 1g
i2V
(5)
zij 2 f0; 1g
(i; j) 2 E
(6)
Constraints (1) state that, if the distance between two nodes i and j is greater than
0
3, then at most one of the nodes i or j may belong to a 3-club. Constraints (2 ) impose
that two nonadjacent nodes i and j may not be both in a 3-club unless either a common
neighbour is in the 3-club or a pair of neighbours p and q of i and j, respectively, linked
by an edge are in the 3-club. Constraints (3) and (4) guarantee that the end nodes of an
edge (i; j) are both in a 3-club whenever the corresponding edge variable zij is equal to 1.
Whenever xp = xq = 1 and (p; q) 2 E the edge variable zpq may be either 0 or 1,
unless there are two other nodes, a node i 2 Np nNq and a node j 2 Nq nNp , such that
xi = xj = 1 and i
p
q
j is the only way of linking i and j with no more than 3
edges. As a consequence, a 3-club in G may be represented by more than one point. To
obtain a one-to-one representation, constraints
zij
xi + xj
1
(i; j) 2 E
(7)
will be included in the formulation.
On the other hand, by conditions (3)
(4), if zpq = 1 then xp = xq = 1. Therefore,
0
in conditions (2 ), if either p or q is in (Ni \ Nj ) variable zpq plays no role and may be
7
dropped. Dropping it, reduces the density of the coe¢ cient matrix and strengthens the
formulation from an LP point of view.
0
Conditions (2 ) will be replaced with
xi + xj
1+
P
xr +
r2(Ni \Nj )
P
i; j 2 V : (i; j) 2
= E; distG (i; j)
zpq
(p;q)2Eij
where Eij = f(p; q) 2 E :
p 2 (Ni nNj );
3
(2)
q 2 (Nj nNi )g.
8
>
< 1; if node i is in the 3-club
xi =
>
: 0; otherwise
De…ning
8
>
< 1; if edge (i; j) links nodes in the 3-club
zij =
>
: 0; otherwise
A maximum cardinality 3-club is an optimal solution of the following integer problem
N EIGHBOU RHOOD F ORM U LAT ION
P
(N )
M AX Z =
xi
i2V
s:t:
(1)
(7)
In the remainder of the paper, constraints (1) will be called node packing constraints
and constraints (2) will be called neighbourhood constraints.
Setting all xi variables to 0:5 and all zij variables to 0 yields a feasible solution for the
linear programming relaxation of (N ). The linear optimum is therefore greater than or
equal to half the number of nodes in G. This means that the integrality gaps tend to be
quite large for sparse graphs and that tighter formulations are needed to solve the 3-club
problem through LP methods. In section 4:3 we deduce lifted forms of constraints (1) and
(2) and in section 4:4 we derive another family of valid inequalities - platform inequalities.
8
4.3
Lifted node packing and neighbourhood constraints
Given two nodes i; j 2 V , if distG (i; j) > 3 then at most one of them may be in a 3-club.
This condition is imposed by the node packing constraints (1).
Node packing constraints may be generalized to include more than two nodes.
LEM M A 1
If I
V is a 3-independent set, then the multi-node packing inequality
P
xi
1
(8)
i2I
is valid for the 3-club problem.
P roof:
Immediate from the de…nition of a 3-independent set.
A rationale similar to the one used to derive inequalities (8) may be used to deduce
valid inequalities over the variables associated with the edge set of G.
Given two edges, (p; q) and (r; s), if the distance between two of the nodes in fp; q; r; sg
is greater than 3, then at most three of these nodes may be in the 3-club. This leads to
the inequality zpq + zrs
1. Inequalities of this kind may be generalized to include edge
sets with more than two elements.
LEM M A 2
If D is a 3-spac set, then the edge packing inequality
P
ze
1
(9)
e2D
9
is valid for the 3-club problem.
P roof:
Immediate from the de…nition of a 3-spac set.
Balasundaram et al. [2] proved that, if I is a maximal 2-independent set, the packing
P
constraint
xi 1 de…nes a facet of the 2-club polytope. By contrast, a packing inequali2I
ity (8) de…ned by a maximal 3-independent node set I may be dominated as illustrated
by the following example.
Consider a graph G with 8 nodes and edge set E = f(i; i + 1); i = 1; :::; 7g. Node
set I = f1; 8g is a maximal 3-independent set in G and the corresponding node packing
inequality is dominated by the valid inequality x1 + x8 + z45
1.
Conditions for a 3-independent set I and a 3-spac set D to de…ne a valid generalizedpacking inequality are presented next.
LEM M A 3
Let I be a 3-independent set in G and let D be a 3-spac set. If spacG (i; e) > 3 for all
i 2 I and all e 2 D, then the generalized-packing inequality
P
i2I
xi +
P
ze
1
(10)
e2D
is valid for the 3-club problem.
10
P roof:
By lemma 1,
P
1. By lemma 2,
xi
i2I
P
ze
e2D
1. If i 2 I and e = (u; v) 2 D then
either distG (i; u) > 3 or distG (i; v) > 3. If xi = 1 then xu xv = 0 and zuv = 0. If ze = 1
then xu = xv = 1 and xi = 0.
Lemmas 1, 2 and 3 may be used to generate lifted versions of neighbourhood constraints,
as follows.
LEM M A 4
Let a, b 2 V be a pair of nodes associated with a neighbourhood constraint (2) and
let I
V be a 3-independent set such that min fdistG (i; a); distG (i; b)g > 3 for all i 2 I.
Then
xa + xb +
P
xi
1+
i2I
P
xr +
r2(Na \Nb )
P
zpq
(11)
(p;q)2Eab
is valid for the 3-club problem.
P roof:
A 3-club may contain at most one node of I. If it has a node of I then it cannot
include neither a nor b.
A Chvátal-Gomory deduction of inequality (11) is obtained combining the neighbourhood constraint (2) for the pair fa; bg and packing inequalities (8) for the 3-independent
sets (I [ fag) and (I [ fbg), with coe¢ cients 0:5 and rounding.
LEM M A 5
Let a, b 2 V be a pair of nodes associated with a neighbourhood constraint (2) and let
D
E be a 3-spac set such that min fspacG (e; a); spacG (e; b)g > 3 for all e 2 D. Then
11
xa + xb +
P
ze
1+
e2D
P
xr +
r2(Na \Nb )
P
zpq
(12)
(p;q)2Eab
is valid for the 3-club problem.
P roof:
By lemma 2,
P
ze
e2D
1. For any e = (u; v) 2 D, if ze = 1 then nodes u and v must be
in the 3-club. But a 3-club that includes nodes u and v cannot include neither a nor b.
A Chvátal-Gomory deduction of inequality (12) is obtained combining the neighbourhood constraint (2) for the pair fa; bg and the generalized packing inequalities (10) for D
and the 3-independent sets I = fag and I = fbg with coe¢ cients 0:5 and rounding.
A more general version of lifted neighbourhood constraints may be obtained with variables representing nodes in a 3-independent set I and variables representing edges in a
3-spac set D.
P ROP OSIT ION 1
Let a, b 2 V be a pair of nodes associated with a neighbourhood constraint (2), let I
be a 3-independent set in G and let D be a 3-spac set. If min fdistG (i; a); distG (i; b)g > 3
for all i 2 I, min fspacG (e; a); spacG (e; b)g > 3 for all e 2 D and spacG (i; e) > 3 for all
i 2 I and all e 2 D, then the generalized neighbourhood constraint for the pair a; b 2 V
xa + xb +
P
i2I
xi +
P
ze
e2D
1+
P
r2(Na \Nb )
is valid for the 3-club problem.
12
xr +
P
(p;q)2Eab
zpq
(13)
P roof:
By lemma 3,
P
xi +
i2I
If
P
xi = 1 then
i2I
If
P
P
ze
1.
e2D
P
ze = 0 and (13) reduces to (11).
e2D
ze = 1 then
e2D
P
xi = 0 and (13) reduces to (12).
i2I
A Chvátal-Gomory deduction of inequality (13) is obtained combining inequalities
(11), (12) and the generalized packing inequality (10) with coe¢ cients 0:5 and rounding.
Consider an inequality (13). If there is an edge f = (u; v) 2 D such that distG (i; u) > 3
for all nodes i 2 (I [ fa; bg) then inequality
P
xa + xb +
i2(I[fug)
xi +
P
e2(Dnff g)
ze
1+
P
r2(Na \Nb )
is also valid for the 3-club problem. As xu
xr +
P
zpq
(14)
(p;q)2Eab
zf , inequality (14) dominates (13).
This dominance indicates that maximal 3-independent sets shall be used in the generation of generalized neighbourhood constraints. In lemma 6 we characterize, for maximal
3-independent sets, the edges associated with ze variables that can be included in a generalized neighbourhood constraint.
LEM M A 6
Consider a pair of nodes, a and b, a maximal 3-independent set I and a 3-spac set D
in the conditions stated in proposition 1. For each edge e = (u; v) 2 D there are nodes
; ! 2 (I [ fa; bg) such that
(i)
distG (u; ) = distG (v; !) = 3
13
(ii)
distG (u; !) = distG (v; ) = 4
P roof:
In the conditions of proposition 1, for all j 2 (I [ fa; bg), spac (e; j) > 3, i.e.,
max fdistG (j; u) ; distG (j; v)g
4.
As I is maximal and u; v 2
= (I [ fa; bg), min fdistG (j; u) : j 2 (I [ fa; bg)g
min fdistG (j; v) : j 2 (I [ fa; bg)g
3 and
3.
As u and v are linked by an edge of D, min fdistG (j; u) : j 2 (I [ fa; bg)g = 3 and
min fdistG (j; v) : j 2 (I [ fa; bg)g = 3.
Let
= arg min fdistG (j; u) : j 2 (I [ fa; bg)g
! = arg min fdistG (j; v) : j 2 (I [ fa; bg)g :
Then distG (u; !) = distG (v; ) = 4
The dominance of (14) over (13) and lemma 6 suggest a two-stage procedure to obtain
a generalized neighbourhood inequality. Given a pair of nodes, a and b, associated with
a neighbourhood constraint (2), in the …rst stage one identi…es a maximum cardinality
_
3-independent set in the conditions of lemma 4, say I; in the second stage, one selects a
_
maximum cardinality 3-spac set in the conditions of lemma 6, for that particular set I.
If a and b are 2 nodes associated with a packing constraint (1) a similar deduction
shows that, if I and D are a maximal 3-independent set and a 3-spac set in the conditions
of proposition 1, then the inequality
xa + xb +
P
i2I
xi +
P
ze
1
(130 )
e2D
is also valid for the 3-club problem. Note that (130 ) is the same as (10) for the 3independent set (I [ fa; bg).
14
4.4
Platform inequalities
Each node packing constraint and each neighbourhood constraint, in formulation (N ),
imposes conditions for a pair of nodes. From them it is possible to deduce conditions for
some triplets of nodes as shown in the following proposition.
P ROP OSIT ION 2
Let R = fa; b; cg be a set of three nodes in G such that E(R; R) = ?. Then the
platform inequality
xa + xb + xc
1+
P
t xt
t2V
with
t
and
pq
=
+
P
pq zpq
(15)
(p;q)2E
8
>
>
2 if t 2 (Na \ Nb \ Nc )
>
>
<
=
1 if t 2 ((Na \ Nb ) [ (Nb \ Nc ) [ (Na \ Nc ))n(Na \ Nb \ Nc ) (16)
>
>
>
>
: 0 otherwise
8
>
>
1 if
>
>
<
(p; q) 2 (Eab [ Ebc [ Eac )
(17)
>
>
>
>
: 0 otherwise
is valid for the 3-club problem.
P roof:
Nodes a, b and c may be in a 3-club, S, if there is in S another node adjacent to all
of them or if for each pair, fa; bg, fb; cg and fa; cg, there is either a common neighbour
or a pair of neighbours linked by an edge.
15
A Chvátal-Gomory deduction of a platform inequality is obtained combining the constraints in (N ) for the pairs fa; bg, fb; cg and fa; cg with coe¢ cients 0:5 and rounding.
Note that, if every pair is associated with a node packing constraint, (15) is a multinode packing inequality that dominates each of the three original node packing constraints
(1). If only one of the pairs is associated with a neighbourhood constraint (2) then the
platform inequality dominates that constraint (2).
Platform inequalities may be lifted adapting the procedure described in the previous
section for the neighbourhood constraints. Lemmas 7 and 8 state results similar to those
in lemmas 4 and 5 considering a triplet fa; b; cg associated with a platform inequality
instead of a pair fa; bg associated with a neighbourhood constraint.
LEM M A 7
Let R = fa; b; cg be a set of three nodes in G such that E(R; R) = ? and let I
V
be a 3-independent set such that min fdistG (i; v); v 2 fa; b; cgg > 3 for all i 2 I. Then
the inequality
xa + xb + xc +
P
i2I
with
t
and
pq
xi
1+
P
t2V
t xt
+
P
pq zpq
(18)
(p;q)2E
de…ned as in (16) and (17) is valid for the 3-club problem.
P roof:
Similar to lemma 4.
16
LEM M A 8
Let R = fa; b; cg be a set of three nodes in G such that E(R; R) = ? and let D
E
be a 3-spac set such that min fspacG (e; v); v 2 fa; b; cgg > 3 for all e 2 D. Then the
inequality
xa + xb + xc +
P
ze
1+
e2D
with
t
and
pq
P
t xt
t2V
+
P
pq zpq
(19)
(p;q)2E
de…ned as in (16) and (17) is valid for the 3-club problem.
P roof:
Similar to lemma 5.
Again, as for the neighbourhood constraints, a more general version of lifted platform
inequalities may be obtained with variables representing nodes in a 3-independent set I
and variables representing edges in a 3-spac set D.
P ROP OSIT ION 3
Let R = fa; b; cg be a set of three nodes in G such that E(R; R) = ?, let I be a
3-independent set in G and let D be a 3-spac set. If min fdistG (i; v) : v 2 fa; b; cgg > 3
for all i 2 I, min fspacG (e; v) : v 2 fa; b; cgg > 3 for all e 2 D and spacG (i; e) > 3 for all
i 2 I and e 2 D, then the lifted version of the platform inequality for nodes a; b; c 2 V
xa + xb + xc +
P
i2I
with
t
and
pq
xi +
P
ze
1+
e2D
P
t2V
t xt
+
P
pq zpq
(20)
(p;q)2E
de…ned as in (16) and (17) is valid for the 3-club problem.
17
P roof:
P
P
By lemma 3,
xi +
ze 1.
i2I
e2D
P
P
If
xi = 1 then
ze = 0 and (20) reduces to (18).
i2I
e2D
P
P
If
ze = 1 then
xi = 0 and (20) reduces to (19).
e2D
i2I
A Chvátal-Gomory deduction of inequalities (20) is obtained combining (18), (19) and
the generalized-packing inequality (10) for I and D with coe¢ cients 0:5 and rounding.
The two-stage lifting procedure described in the previous section may be adapted to
generate generalized-packing inequalities substituting fa; bg by fa; b; cg.
18
References
[1] R. D. Alba, A graph-theoretic de…nition of a sociometric clique, Journal of Mathematical Sociology 3 (1973) 113-126.
[2] B. Balasundaram, S. Butenko, S. Trukhanov, Novel approaches for analyzing biological networks, Journal of Combinatorial Optimization 10 (2005) 23-39.
[3] I. M. Bomze, M. Budinich, P.M. Pardalos, M. Pelillo, The maximum clique problem, in D.-Z. Du and P.M. Pardalos (Eds.), Handbook of Combinatorial Optimization.
Dordrecht, The Netherlands, Kluwer Academic Publishers, 1999, 1-74.
[4] J.-M. Bourjolly, G. Laporte, G. Pesant, An exact algorithm for the maximum k-club
problem in an undirected graph, European Journal of Operational Research 138 (2002)
21-28.
[5] J.-M. Bourjolly, G. Laporte, G. Pesant, Heuristics for …nding k-clubs in an undirected graph, Computers & Operations Research 27 (2000) 559-569.
[6] F. D. Carvalho, M. T. Almeida, Strong valid inequalities for the 2-club problem,
Centro de Investigação Operacional, Working Paper 2/2008 (available at http://cio.fc.ul.pt).
[7] R. J. Mokken, Cliques, clubs and clans, Quality and Quantity 13 (1979) 161-173.
[8] G. L. Nemhauser, L. A. Wolsey, Integer and Combinatorial Optimization, John
Wiley, New York, 1988.
19