Response to Editors and Referees for their Review of “A General

Response to Editors and Referees for their Review of “A
General Framework for Fast Stagewise Algorithms”, Round 2
Ryan J. Tibshirani
Carnegie Mellon University
[email protected]
I would like to thank the Associate Editor and Referees for another round of helpful reviews of
my paper. I have edited the paper per your suggestions. I believe the result is yet another improved
manuscript. Below is a point-by-point response to the comments raised by the referees.
Referee 1
Thank you for another encouraging and helpful review of my paper.
A technical comment: In Section 3.5, the author presents a specification of the stagewise
algorithm to the generalized lasso problem in Equation (37). Since the stagewise update
in Equation (6) with constraint ... In my opinion, it will be more rigorous to acknowledge
that the constrained problem (40) is in fact a dual problem of the following generalized
lasso regularization problem ...
You’re right, what you suggested is more clear. I have revised this section accordingly.
Continue the above comment, based on the maximizing argument of convex conjugate
and the primal-dual relationship (39), it holds that ... This implies that the stagewise
algorithm can be applied to the dual problem (40) without explicitly evaluating the convex
conjugate function and its gradient ...
This is an excellent point! As you say, this completely eliminates the need to consider the convex
conjugate function f ∗ of f at all, and therefore all one needs to do is “invert” the primal-dual
relationship for β (k) , at each step k. I have rewritten the section to explain this, and acknowledged
your insights accordingly.
Referee 2
Thank you for another constructive and thorough reading of my paper.
Comparison. I partially agree with the authors discussion on papers Friendman (2008)
and Zhao and Yu (2007). I agree that the algorithm in Friendman (2008) is not applicable
to very general regularization setting such as trace norm regularization and fused Lasso,
but it is still applicable to group Lasso and ridge logistic regression. I also agree that the
algorithm in Zhao and Yu (2007) may require to update the estimate by minimizing the
penalized loss function directly. But the minimization is taken in a coordinate manner
and I dont think it is intractable for at least the regularization problems of group Lasso,
ridge logistic regression, fused Lasso (one can compute 2p different values of the penalized
loss function and then identify the smallest one). For trace norm regularization, the
1
update of the algorithm in Zhao and Yu (2007) may be computationally heavy since it
requires to compute the trace norm or the subgradient of the trace norm. Therefore, I
still want to see some comparisons of these algorithms in terms of computational time
and statistical performance for the regularization problems that they are applicable to.
It is not clear how to use Friedman (2008) for the fused lasso or matrix completion problems. The
penalty cannot be written in terms of the absolute individual components |βi | and so Friedman’s
path update cannot be readily performed. There is a deeper issue lurking here: Friedman (2008) is
specific to a class of “basis” estimation problems, where the coefficients β index the basis functions.
Problems like the fused lasso (say, over a 2d grid or a graph) and matrix completion don’t admit a
“basis” representation, and so this doesn’t make sense.
Zhao and Yu (2007), as you explain, is computationally intractable for the matrix completion
problem. For fused lasso problems, we run into a similar issue as the one I explained above for
Friedman (2008): it doesn’t make sense to take individual steps on components of β here, because
the penalty doesn’t “separate” into terms involving |βi |.
Friedman (2008) and Zhao and Yu (2007) can be applied to the ridge logistic regression and
group lasso problems. An important note: I have not been able to find any software for Zhao and
Yu (2007), and the software provided by Friedman (2008) only considers (generalized) elastic net
penalties. Furthermore, Friedman’s software is entangled with other tasks like model selection via
cross-validation and this will slow down the implementation. So I’ve implemented both Zhao and
Yu’s and Friedman’s methods myself, for the ridge logistic regression problem.
(Note: I think it is a preferable to look at ridge logistic regression to the group lasso, since the
latter is really quite close in spirit to the lasso, and Friedman (2008) and Zhao and Yu (2007) are
essentially designed for sparse estimation problems, so I would expect this comparison to be similar
to the comparisons run for the lasso case, which have already been done. In summary, I would expect
that their methods perform fine for the group lasso, though they would still be hindered a bit, since
they would only ever move single components at a time, rather than groups of components.)
Hence see Figure 1 for a comparison of the three methods: stagewise, GPS (for generalized path
seeking, Friedman’s algorithm), and ZY (for Zhao and Yu). All implementations written were in R.
The setup is the same as that in ridge logistic regression setup considered in the paper (Appendix
A.4, uncorrelated predictors case).
Per iteration, stagewise and GPS have very similar computational costs. ZY on the other hand is
signficantly more expensive per iteration. This is because it needs to scan over 2p candidate possibilities each time it makes an update, and for each possibility, it needs to fully evaluate the regularized
logistic loss. There may be a clever way to leverage the similarity between these loss evaluations, but
by the same token, there may be clever ways to leverage the similarity in stagewise/GPS updates
across steps, and since we are looking at the most basic implementation of each method, these clever
tricks were not pursued. Due to its computational complexity, ZY was only run with a moderate
stepsize (so that the number of steps could be kept relatively small). In addition to this increased
iteration cost, ZY also suffers computationally from its “backward” steps, which don’t progress the
amount of regularization in the estimate at all. E.g., for the above example, nearly the last 35 steps
(of 80 total) are spent on backwards steps.
As for the statistical comparison, we can see that the test error vs `2 norm tradeoff for GPS is
much less favorable than that for the exact ridge solution path and stagewise (for the latter two,
there is a dip, and the minimum test error occurs at about `2 norm equal to 5, but the magnitude
of difference between these errors and those from GPS washes away its shape). On one hand, this
difference should not be too surprising, since GPS is not designed to navigate the tradeoff between
test error and `2 norm, it is instead designed to navigate the tradeoff between test error and some
metric of sparsity. On the other hand, the sheer difference between its performance and that of
exact ridge logistic regression and stagewise is somewhat striking.
2
0.5
Uncorrelated predictors
●
●
●
●
●
●
0.4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●●
●●●●
●●
●
●
●●●
●
●●
●
●●
●●●
●
●●●●
●
●
●
●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●●●●●
●
●
●
●
●
●● ●
●●●●
●
●
●●●●●●●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●●
●
●
●
●
●
●
●●●●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
0.2
0.3
●
0.1
Test misclassification rate
●
Exact
Stagewise, eps=0.0025
Stagewise, eps=0.25
GPS, eps=0.25
GPS, eps=1
ZY, eps=0.25
0
●
●
●
2
4
6
L2 norm
Algorithm timings
Method
Exact: coordinate descent, 100 solutions
Stagewise: = 0.0025, 150 estimates
Stagewise: = 0.25, 15 estimates
GPS: = 0.25, 200 estimates
GPS: = 1, 50 estimates
ZY: = 0.25, ξ = 0.0625, 80 estimates
Time (seconds)
9.493
7.398
0.688
8.999
2.193
815.381
Figure 1: More ridge logistic regression comparisons.
ZY manages to achieve a test error comparable to that of the exact ridge logistic regression
solution, and stagewise, but it requires a substantially higher computational cost (and substantially
more iterations) to get there.
Paper length. Though the paper is shorter than the first version, but it is still long (40
pages main text plus 16 pages Appendix). It would be better to cut another 5 to 10 pages
by making the discussions more concise.
I have further shortened the paper down to 35.5 pages of main text.
The current version is better-organized than the first version, but I think it can still be
improved. For instance, Section 3 is on the applications of the general stagewise algorithm
to different regularization problems (one problem in each sub-section). But Section 3.6
is not an independent application problem. Instead, it discusses a special case of Section
3.5. It may be better to merge these two Sections into one Section; the title of Section 2
seems not to be very accurate since this section contains not only the motivation of the
general stagewise algorithm, but also some basic properties and related work. How about
changing the title of this section to, e.g., A General stagewise framework?
Thank for these helpful comments. I have merged the two sections as you suggested, and changed
the title of Section 2, as you suggested.
Choice of the step size . I apologize for the misunderstanding of the heuristic guideline
for the choice of . Now, Im clear that the computation of the heuristic method is not an
issue. But how does this heuristic method perform (statistical aspect) in the regularization
problems in Section 3?
I am glad the heuristic method is clear now. A formal (theoretical) study of what the heuristic
method implies, statistically, for the regularization problems considered in Section 3, is beyond the
scope of the current paper. Indeed, as I have noted in the paper, the formal statistical properties
of stagewise estimates, even under an “oracle” choice of step size (say, these are limiting step sizes
that go to zero at a proper rate) are not well-understood, and beyond the scope of the current
paper. I am definitely planning to pursue these problems, and have made some steps of progress in
these directions, but this will constitute another project. However, one perhaps obvious point, as I
mention in the paper, is that cross-validation can be used to evaluated the statistical aspects of any
given choice of step size.
As for what the heuristic method has to say about how the choice of in the big simulations of
Section 4: each considered choice of here “passes” the heuristic test. E.g., for the group lasso case,
Figure 2 shows a plot showing the heuristic metrics (decrease of f , and increase of g).
We can see that f and g behave monotonically across the step sizes, both when = 1 and = 10
(though for = 1 the stagewise method is able to eventually achieve a slightly lower loss f , and for
= 10, a slightly higher regularizer g). Because f and g do not wiggle back and forth, the heuristic
method would have been fine with either choice of . That is, if started at = 10, it would have
kept = 10 throughout the path, and if started at = 1, it would have kept = 1 throughout the
path.
4
0
50
100
150
k
200
250
200
0
50
100
150
eps=1
eps=10
g(xk)
20000
15000
10000
5000
0
f(xk)
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
eps=1
eps=10
0
50
100
150
k
Figure 2: Heuristic metrics for the group lasso example.
200
250