Response to Editors and Referees for their Review of “A General Framework for Fast Stagewise Algorithms”, Round 2 Ryan J. Tibshirani Carnegie Mellon University [email protected] I would like to thank the Associate Editor and Referees for another round of helpful reviews of my paper. I have edited the paper per your suggestions. I believe the result is yet another improved manuscript. Below is a point-by-point response to the comments raised by the referees. Referee 1 Thank you for another encouraging and helpful review of my paper. A technical comment: In Section 3.5, the author presents a specification of the stagewise algorithm to the generalized lasso problem in Equation (37). Since the stagewise update in Equation (6) with constraint ... In my opinion, it will be more rigorous to acknowledge that the constrained problem (40) is in fact a dual problem of the following generalized lasso regularization problem ... You’re right, what you suggested is more clear. I have revised this section accordingly. Continue the above comment, based on the maximizing argument of convex conjugate and the primal-dual relationship (39), it holds that ... This implies that the stagewise algorithm can be applied to the dual problem (40) without explicitly evaluating the convex conjugate function and its gradient ... This is an excellent point! As you say, this completely eliminates the need to consider the convex conjugate function f ∗ of f at all, and therefore all one needs to do is “invert” the primal-dual relationship for β (k) , at each step k. I have rewritten the section to explain this, and acknowledged your insights accordingly. Referee 2 Thank you for another constructive and thorough reading of my paper. Comparison. I partially agree with the authors discussion on papers Friendman (2008) and Zhao and Yu (2007). I agree that the algorithm in Friendman (2008) is not applicable to very general regularization setting such as trace norm regularization and fused Lasso, but it is still applicable to group Lasso and ridge logistic regression. I also agree that the algorithm in Zhao and Yu (2007) may require to update the estimate by minimizing the penalized loss function directly. But the minimization is taken in a coordinate manner and I dont think it is intractable for at least the regularization problems of group Lasso, ridge logistic regression, fused Lasso (one can compute 2p different values of the penalized loss function and then identify the smallest one). For trace norm regularization, the 1 update of the algorithm in Zhao and Yu (2007) may be computationally heavy since it requires to compute the trace norm or the subgradient of the trace norm. Therefore, I still want to see some comparisons of these algorithms in terms of computational time and statistical performance for the regularization problems that they are applicable to. It is not clear how to use Friedman (2008) for the fused lasso or matrix completion problems. The penalty cannot be written in terms of the absolute individual components |βi | and so Friedman’s path update cannot be readily performed. There is a deeper issue lurking here: Friedman (2008) is specific to a class of “basis” estimation problems, where the coefficients β index the basis functions. Problems like the fused lasso (say, over a 2d grid or a graph) and matrix completion don’t admit a “basis” representation, and so this doesn’t make sense. Zhao and Yu (2007), as you explain, is computationally intractable for the matrix completion problem. For fused lasso problems, we run into a similar issue as the one I explained above for Friedman (2008): it doesn’t make sense to take individual steps on components of β here, because the penalty doesn’t “separate” into terms involving |βi |. Friedman (2008) and Zhao and Yu (2007) can be applied to the ridge logistic regression and group lasso problems. An important note: I have not been able to find any software for Zhao and Yu (2007), and the software provided by Friedman (2008) only considers (generalized) elastic net penalties. Furthermore, Friedman’s software is entangled with other tasks like model selection via cross-validation and this will slow down the implementation. So I’ve implemented both Zhao and Yu’s and Friedman’s methods myself, for the ridge logistic regression problem. (Note: I think it is a preferable to look at ridge logistic regression to the group lasso, since the latter is really quite close in spirit to the lasso, and Friedman (2008) and Zhao and Yu (2007) are essentially designed for sparse estimation problems, so I would expect this comparison to be similar to the comparisons run for the lasso case, which have already been done. In summary, I would expect that their methods perform fine for the group lasso, though they would still be hindered a bit, since they would only ever move single components at a time, rather than groups of components.) Hence see Figure 1 for a comparison of the three methods: stagewise, GPS (for generalized path seeking, Friedman’s algorithm), and ZY (for Zhao and Yu). All implementations written were in R. The setup is the same as that in ridge logistic regression setup considered in the paper (Appendix A.4, uncorrelated predictors case). Per iteration, stagewise and GPS have very similar computational costs. ZY on the other hand is signficantly more expensive per iteration. This is because it needs to scan over 2p candidate possibilities each time it makes an update, and for each possibility, it needs to fully evaluate the regularized logistic loss. There may be a clever way to leverage the similarity between these loss evaluations, but by the same token, there may be clever ways to leverage the similarity in stagewise/GPS updates across steps, and since we are looking at the most basic implementation of each method, these clever tricks were not pursued. Due to its computational complexity, ZY was only run with a moderate stepsize (so that the number of steps could be kept relatively small). In addition to this increased iteration cost, ZY also suffers computationally from its “backward” steps, which don’t progress the amount of regularization in the estimate at all. E.g., for the above example, nearly the last 35 steps (of 80 total) are spent on backwards steps. As for the statistical comparison, we can see that the test error vs `2 norm tradeoff for GPS is much less favorable than that for the exact ridge solution path and stagewise (for the latter two, there is a dip, and the minimum test error occurs at about `2 norm equal to 5, but the magnitude of difference between these errors and those from GPS washes away its shape). On one hand, this difference should not be too surprising, since GPS is not designed to navigate the tradeoff between test error and `2 norm, it is instead designed to navigate the tradeoff between test error and some metric of sparsity. On the other hand, the sheer difference between its performance and that of exact ridge logistic regression and stagewise is somewhat striking. 2 0.5 Uncorrelated predictors ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●● ●●●● ●● ● ● ●●● ● ●● ● ●● ●●● ● ●●●● ● ● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ● ●●●● ● ● ●●●●●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●●●●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●●●●● ● ● ● ● ● ● ● 0.2 0.3 ● 0.1 Test misclassification rate ● Exact Stagewise, eps=0.0025 Stagewise, eps=0.25 GPS, eps=0.25 GPS, eps=1 ZY, eps=0.25 0 ● ● ● 2 4 6 L2 norm Algorithm timings Method Exact: coordinate descent, 100 solutions Stagewise: = 0.0025, 150 estimates Stagewise: = 0.25, 15 estimates GPS: = 0.25, 200 estimates GPS: = 1, 50 estimates ZY: = 0.25, ξ = 0.0625, 80 estimates Time (seconds) 9.493 7.398 0.688 8.999 2.193 815.381 Figure 1: More ridge logistic regression comparisons. ZY manages to achieve a test error comparable to that of the exact ridge logistic regression solution, and stagewise, but it requires a substantially higher computational cost (and substantially more iterations) to get there. Paper length. Though the paper is shorter than the first version, but it is still long (40 pages main text plus 16 pages Appendix). It would be better to cut another 5 to 10 pages by making the discussions more concise. I have further shortened the paper down to 35.5 pages of main text. The current version is better-organized than the first version, but I think it can still be improved. For instance, Section 3 is on the applications of the general stagewise algorithm to different regularization problems (one problem in each sub-section). But Section 3.6 is not an independent application problem. Instead, it discusses a special case of Section 3.5. It may be better to merge these two Sections into one Section; the title of Section 2 seems not to be very accurate since this section contains not only the motivation of the general stagewise algorithm, but also some basic properties and related work. How about changing the title of this section to, e.g., A General stagewise framework? Thank for these helpful comments. I have merged the two sections as you suggested, and changed the title of Section 2, as you suggested. Choice of the step size . I apologize for the misunderstanding of the heuristic guideline for the choice of . Now, Im clear that the computation of the heuristic method is not an issue. But how does this heuristic method perform (statistical aspect) in the regularization problems in Section 3? I am glad the heuristic method is clear now. A formal (theoretical) study of what the heuristic method implies, statistically, for the regularization problems considered in Section 3, is beyond the scope of the current paper. Indeed, as I have noted in the paper, the formal statistical properties of stagewise estimates, even under an “oracle” choice of step size (say, these are limiting step sizes that go to zero at a proper rate) are not well-understood, and beyond the scope of the current paper. I am definitely planning to pursue these problems, and have made some steps of progress in these directions, but this will constitute another project. However, one perhaps obvious point, as I mention in the paper, is that cross-validation can be used to evaluated the statistical aspects of any given choice of step size. As for what the heuristic method has to say about how the choice of in the big simulations of Section 4: each considered choice of here “passes” the heuristic test. E.g., for the group lasso case, Figure 2 shows a plot showing the heuristic metrics (decrease of f , and increase of g). We can see that f and g behave monotonically across the step sizes, both when = 1 and = 10 (though for = 1 the stagewise method is able to eventually achieve a slightly lower loss f , and for = 10, a slightly higher regularizer g). Because f and g do not wiggle back and forth, the heuristic method would have been fine with either choice of . That is, if started at = 10, it would have kept = 10 throughout the path, and if started at = 1, it would have kept = 1 throughout the path. 4 0 50 100 150 k 200 250 200 0 50 100 150 eps=1 eps=10 g(xk) 20000 15000 10000 5000 0 f(xk) ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● eps=1 eps=10 0 50 100 150 k Figure 2: Heuristic metrics for the group lasso example. 200 250
© Copyright 2026 Paperzz