row-column

Massive Support Vector Regression
(via Row and Column Chunking)
David R. Musicant and O.L. Mangasarian
NIPS 99 Workshop on
Learning With Support Vectors
December 3, 1999
http://www.cs.wisc.edu/~musicant
Chunking with 1 billion nonzero elements
Tuning Set Error
25000
20%
20000
18%
Tuning Set Error
Objective Value
Objective Value
15000
10000
5000
16%
14%
12%
10%
8%
0
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
Outline

Problem Formulation
– New formulation of Support Vector Regression (SVR)
– Theoretically close to LP formulation of Smola, Schölkopf,
Rätsch
– Interpretation of perturbation parameter

Numerical Comparisons
– Speed comparisons of our method and prior formulations

Massive Regression
– Chunking methods for solving large problems
• Row chunking
• Row-column chunking

Conclusions & Future Work
Support Vector Tolerant Regression


e-insensitive interval within which errors are tolerated
can improve performance on testing sets by avoiding
overfitting
Deriving the SVR Problem

m points in Rn, represented by an m x n matrix A.

y 2 R m is the vector to be approximated.
We wish to solve:
Aw + be щ y
(e is a vector of ones)
0
Let w be represented by the dual formulation w = A л :
AA 0л + be щ y
This suggests replacing AA’ by a general nonlinear kernel K(A,A’):
K (A; A 0)л + be щ y
Measure the error by s, with a tolerance e:
а s ф K (A; A 0)л + be а y ф s
e" ф s
bound errors
tolerance
Deriving the SVR Problem (continued)
Add regularization term and minimize the error with weight C > 0:
regularization
min m1 jj л jj 1 +
C jj jj
s 1
m
error
(л;b;s)
а s ф K (A; A 0)л + be а y ф s
e" ф s
bound errors
tolerance
Parametrically maximize the tolerance e via parameter ц 2 [0; 1] .
This maximizes the minimum error component, thereby
resulting in error uniformity.
regularization
min
(л;b;s;" ;a)
1 0
C 0 а
e
a
+
es
m
m
error
Cц"
interval size
а s ф K (A; A 0)л + be а y ф s
e" ф s
а aфл фa
bound errors
tolerance
regularization
Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation

Our formulation
min
1 0
C 0 а
e
a
+
es
m
m
Cц"
(л;b;s;" ;a)
а s ф K (A; A 0)л + be а y ф s
e" ф s
tolerance as a
constraint
а aфл фa
single error bound

Smola, Schölkopf, Rätsch
min
1 0 1
e (л +
m
л 2) +
C 0 1
e (ш +
m
ш2) + C(1 а ц)"
(л 1;л 2;b;ш1;ш2;" )
а ш2 а e" ф K (A; A 0)(л 1 а л 2) + be а y ф ш1 + e"
л 1; л 2; ш1; ш2 х 0
multiple error bounds

Reduction in:
– Variables:
• 4m+2 --> 3m+2
– Solution time
Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation

Our formulation
1 0
C 0 а
e
a
+
es
m
m
min

– Variables:
• 4m+2 --> 3m+2
– Solution time
Cц"
(л;b;s;" ;a)
а s ф K (A; A 0)л + be а y ф s
e" ф s
tolerance as a
constraint
а aфл фa

Reduction in:
single error bound
Smola, Schölkopf, Rätsch
min
1 0 1
e (л +
m
л 2) +
C 0 1
e (ш +
m
ш2) + C(1 а ц)"
(л 1;л 2;b;ш1;ш2;" )
а ш2 а e" ф K (A; A 0)(л 1 а л 2) + be а y ф ш1 + e"
л 1; л 2; ш1; ш2 х 0
multiple error bounds
Natural interpretation for m

ц = 0 : our linear program is equivalent to classical
stabilized least 1-norm approximation problem
min jj л jj 1 + Cjj K (A; A 0)л + be а yjj 1
(л;b)


Perturbation theory results show there exists a fixed
ц 2 (0; 1] such that:
For all ц 2 (0; ц ]
– we solve the above stabilized least 1-norm problem
– additionally we maximize e, the least error component

As m goes from 0 to 1,
– least error component e is monotonically nondecreasing
function of m.
Numerical Testing

Two sets of tests
– Compare computational times of our method (MM) and the
SSR method
– Row-column chunking for massive datasets

Datasets:
–
–
–
–

US Census Bureau Adult Dataset: 300,000 points in R11
Delve Comp-Activ Dataset: 8192 points in R13
UCI Boston Housing Dataset: 506 points in R13
Gaussian noise was added to each of these datasets.
Hardware: Locop2: Dell PowerEdge 6300 server with:
– Four gigabytes of memory, 36 gigabytes of disk space
– Windows NT Server 4.0
– CPLEX 6.5 solver
Experimental Process



m is a parameter which needs to be determined
experimentally
Use a hold-out tuning set to determine optimal value
for m
Algorithm:
m=0
while (tuning set accuracy continues to improve)
{
Solve LP
m = m + 0.1
}

Run for both our method and SSR methods and
compare times
Comparison Results
m
Dataset
0
0.1
0.2
...
0.7
Total
Time
Time (sec) Improvement
Census Tuning set error
5.10%
4.74%
Max
0.00
0.02
79.7%
SSR time (sec)
980
935
5086
Avg
MM time (sec)
199
294
3765
26.0%
Tuning set error
6.60%
6.32%
Max
0.00
3.09
65.7%
SSR time (sec)
1364
1286
7604
Avg
MM time (sec)
468
660
6533
14.1%
e
CompActiv
e
Boston Tuning set error
Housing
e
14.69% 14.62%
Max
0.00
0.42
52.0%
SSR time (sec)
36
34
170
Avg
MM time (sec)
17
23
140
17.6%
Linear Programming Row Chunking





Basic approach: (PSB/OLM) for classification problems
Classification problem is solved for a subset, or chunk of
constraints (data points)
Those constraints with positive multipliers are preserved
and integrated into next chunk (support vectors)
Objective function is montonically nondecreasing
Dataset is repeatedly scanned until objective function
stops increasing
Innovation: Simultaneous Row-Column Chunking

Mapping of data points to constraints
– Classification: Each data point yields one constraint.
– Regression: Each data point yields two constraints. RowColumn Chunking manages which constraint to maintain for
next chunk.

Fixing dual variables at upper bounds for efficiency
– Classification:
• Simple to do since problem is coded in its dual formulation.
• Any support vectors with dual variables at upper bound are held
constant in successive chunks.
– Regression:
• Primal formulation was used for efficiency purposes.
• We therefore aggregated all constraints with fixed multipliers to
yield a single constraint.
Innovation: Simultaneous Row-Column Chunking

Large number of columns
– Row Chunking
• Implemented for a linear kernel only.
• Cannot handle problems with large numbers of variables,
and hence limited practically to linear kernels.
– Row-Column Chunking
• Implemented for a general nonlinear kernel.
• New data increase the dimensionality of K(A,A’) by
adding both rows and columns (variables) to the
problem.
• We handle this with row-column chunking.
Row-Column Chunking Algorithm
while (problem termination criteria not satisfied)
{
choose a set of rows from the problem as a row chunk
while (row chunk termination criteria not satisfied) {
from this row chunk, select a set of columns
solve the LP allowing only these columns as variables
add those columns with nonzero values to the next column
chunk
}
add those rows with nonzero dual multipliers to the next row
chunk
}
Row-Column Chunking Diagram
Step 1a
Step 1b
Step 1c
Step 2a
Step 2b
Step 2c
Step 3a
Step 3b
loop
loop
loop
Step 3c
Chunking Experimental Results
Dataset:
Kernel:
LP Size:
Problem Size:
Time to termination:
Number of SVs:
Solution variables:
Final tuning set error:
Tuning set error on first
chunk (1000 points)
16,000 point subset of Census in R 11+ noise
Gaussian Radial Basis Kernel
32,000 nonsparse rows and columns
1.024 billion nonzero values
18.8 days
1621 support vectors
33 nonzero components
9.8%
16.2%
Objective Value & Tuning Set Error
for Billion-Element Matrix
Tuning Set Error
25000
20%
20000
18%
Tuning Set Error
Objective Value
Objective Value
15000
10000
5000
16%
14%
12%
10%
8%
0
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
Conclusions and Future Work


Conclusions
– Support Vector Regression can be handled more
efficiently using improvements on previous
formulations
– Row-column chunking is a new approach which
can handle massive regression problems
Future work
– Generalizing to other loss functions, such as
Huber M-estimator
– Extension to larger problems using parallel
processing for both linear and quadratic
programming formulations
Questions?
LP Perturbation Regime #1

Our LP is given by:
min
1 0
C 0 а
e
a
+
es
m
m
Cц"
(л;b;s;" ;a)
а s ф K (A; A 0)л + be а y ф s
e" ф s
а aфл фa


When m = 0, the solution is the stabilized least 1norm solution.
Therefore, by LP Perturbation Theory, there exists a
ц 2 (0; 1] such that
– The solution to the LP with ц 2 (0; ц ] is a solution to the
least 1-norm problem that also maximizes e.
LP Perturbation Regime #2

Our LP can be rewritten as:
min
1 jj jj
C 0j j
л
+
e( d
1
m
m
а e" ) + + C(1 а ц)"
(л;b;" ;d)
d = K (A; A 0)л + be а y

Similarly, by LP Perturbation Theory, there exists a
ц
а 2 [0; 1) such that
а; 1) is the solution that
– The solution to the LP with ц 2 [ц
minimizes least error (e) among all minimizers of average
tolerated error.
Motivation for dual variable substitution

Primal:
min чe0s + 12jj wjj 2
(w;b;s)
а s ф Aw + be а y ф s

Dual:
min 12jj A 0(u а v) jj 2 + y(u а v)
(u;v)
e0u = ev0
u + v = чe
u; v х 0
w = A 0л = A 0(u а v)